Data Engineering on Databricks
Databricks is one of the most popular platforms for data processing using Apache Spark and for creating modern data warehouses (Lakehouse). In this course, you will learn everything you need for a confident start with Databricks, from the basics of the platform to creating your own pipelines and connecting BI tools.
You will learn how Databricks works, why to use it, create your notebooks, set up a computing cluster, and get acquainted with Databricks SQL Warehouse.
Read more about the course
1. Installation and Data Preparation
Before getting started with practical work, you will set up Databricks on AWS, create an S3 bucket for data storage, and set up a workspace. You will also examine the AWS CloudFormation template that Databricks uses to understand how the infrastructure is automatically deployed.
You will review the created cluster and become familiar with the dataset on which you will build your ETL process.
2. Practice: Data Processing
You will learn two ways to load data into Databricks: directly or through S3 followed by integration. You will also learn how to create code repositories. This can be done in two ways: connect a GitHub repository or create a repository manually right in Databricks.
During the project, you will complete two key tasks:
- ETL Data Processing: run the pipeline, perform transformation, create tables, and save them in Databricks.
- Data Visualization: perform analysis with Spark SQL in a separate notebook and create visualizations.
You will also learn how data is stored within Databricks.
3. Data Warehouse and External Integrations
Finally, you will connect Power BI to Databricks and try both integration methods: through a compute cluster and through SQL Warehouse. This way, you will learn how to integrate Databricks with external analytics tools.
Recommendations Before Starting
Before starting this course, it is recommended to complete the "Apache Spark Basics" course. With these foundational skills, you will be able to work effectively in Databricks.
Requirements:
- AWS account
- Databricks account
- Knowledge of basic Spark (at the level of the "Spark Fundamentals" course)
- Minimal costs on AWS (especially within the free tier)
Watch Online Data Engineering on Databricks
# | Title | Duration |
---|---|---|
1 | Introduction | 02:56 |
2 | Why Databricks | 04:05 |
3 | Pricing explained | 06:51 |
4 | Create Databricks Account & Workspace | 07:09 |
5 | AWS Resources created by Databricks | 04:03 |
6 | Intro Databricks UI & Compute Cluster | 06:05 |
7 | The Dataset | 02:46 |
8 | Goals ETL & Visualization pipeline explained | 02:16 |
9 | Import Data in Databricks UI | 04:44 |
10 | Databricks Data in S3 | 02:09 |
11 | Creating code Repos | 04:35 |
12 | Running our ETL job | 09:26 |
13 | Explore Data Tables in AWS folders | 02:16 |
14 | Explore data with databricks notebook 1 | 05:55 |
15 | Explore data with databricks notebook 2 | 06:45 |
16 | Compute Cluster vs Databricks SQL Warehouse | 04:11 |
17 | Power BI queries through computer cluster | 04:21 |
18 | Power BI queries through Databricks SQL Warehouse | 04:44 |
19 | Conclusion | 02:12 |
Similar courses to Data Engineering on Databricks

MongoDB FundamentalsAndreas Kretz

Platform & Pipeline SecurityAndreas Kretz

DS4B 101-P: Python for Data Science AutomationBusiness Science University

Python for Data EngineersAndreas Kretz

Apache Airflow Workflow OrchestrationAndreas Kretz

Dimensional Data ModelingEka Ponkratova

Schema Design Data StoresAndreas Kretz

2022 Python for Machine Learning & Data Science Masterclassudemy

Case Study in Product Data ScienceLunarTech
