Skip to main content

Data Engineering on Databricks

1h 27m 29s
English
Paid

Databricks is one of the most popular platforms for data processing using Apache Spark and is essential for creating modern data warehouses (Lakehouse). In this course, you will acquire the skills necessary for a confident start with Databricks, from understanding the basics of the platform to developing your own data pipelines and integrating BI tools.

You'll delve into how Databricks functions, why it's beneficial, how to create your notebooks, configure a computing cluster, and familiarize yourself with Databricks SQL Warehouse.

Installation and Data Preparation

Before engaging in practical activities, you will set up Databricks on AWS, create an S3 bucket for data storage, and configure a workspace. You'll also explore the AWS CloudFormation template that facilitates automatic infrastructure deployment via Databricks.

You'll review the created cluster and acquaint yourself with the dataset that will underpin your ETL process.

Practice: Data Processing

Learn two methods to load data into Databricks: directly or through S3, followed by integration. Additionally, you'll learn how to establish code repositories, either by connecting a GitHub repository or creating one manually within Databricks.

Throughout the project, you'll accomplish two primary tasks:

  • ETL Data Processing: Execute the pipeline, perform transformations, create tables, and save them in Databricks.
  • Data Visualization: Conduct analysis with Spark SQL in a separate notebook and develop visualizations.

You will also understand the data storage mechanisms within Databricks.

Data Warehouse and External Integrations

In the final stage, you will connect Power BI to Databricks and experiment with two integration methods: via a compute cluster and through SQL Warehouse. This experience will teach you how to integrate Databricks with external analytics tools effectively.

Recommendations Before Starting

Before commencing this course, it is recommended to complete the "Apache Spark Basics" course. These foundational skills are crucial for effective work in Databricks.

Requirements

  • AWS account
  • Databricks account
  • Basic knowledge of Spark (equivalent to the "Spark Fundamentals" course)
  • Minimal costs on AWS (especially within the free tier)

About the Author: Andreas Kretz

Andreas Kretz thumbnail

I am a senior data engineer and trainer, a tech enthusiast, and a father. For more than ten years, I have been passionate about Data Engineering. Initially, I became a self-taught data engineer and then led a team of data engineers at a large company. When I realized the great demand for education in this field, I followed my passion and founded my own Data Engineering Academy. Since then, I have helped over 2,000 students achieve their goals.

Watch Online 19 lessons

This is a demo lesson (10:00 remaining)

You can watch up to 10 minutes for free. Subscribe to unlock all 19 lessons in this course and access 10,000+ hours of premium content across all courses.

View Pricing
0:00
/
#1: Introduction
All Course Lessons (19)
#Lesson TitleDurationAccess
1
Introduction Demo
02:56
2
Why Databricks
04:05
3
Pricing explained
06:51
4
Create Databricks Account & Workspace
07:09
5
AWS Resources created by Databricks
04:03
6
Intro Databricks UI & Compute Cluster
06:05
7
The Dataset
02:46
8
Goals ETL & Visualization pipeline explained
02:16
9
Import Data in Databricks UI
04:44
10
Databricks Data in S3
02:09
11
Creating code Repos
04:35
12
Running our ETL job
09:26
13
Explore Data Tables in AWS folders
02:16
14
Explore data with databricks notebook 1
05:55
15
Explore data with databricks notebook 2
06:45
16
Compute Cluster vs Databricks SQL Warehouse
04:11
17
Power BI queries through computer cluster
04:21
18
Power BI queries through Databricks SQL Warehouse
04:44
19
Conclusion
02:12
Unlock unlimited learning

Get instant access to all 18 lessons in this course, plus thousands of other premium courses. One subscription, unlimited knowledge.

Learn more about subscription