Skip to main content
CF

Data Engineering on Databricks

1h 27m 29s
English
Paid

Databricks is one of the most popular platforms for data processing using Apache Spark and is essential for creating modern data warehouses (Lakehouse). In this course, you will acquire the skills necessary for a confident start with Databricks, from understanding the basics of the platform to developing your own data pipelines and integrating BI tools.

You'll delve into how Databricks functions, why it's beneficial, how to create your notebooks, configure a computing cluster, and familiarize yourself with Databricks SQL Warehouse.

Installation and Data Preparation

Before engaging in practical activities, you will set up Databricks on AWS, create an S3 bucket for data storage, and configure a workspace. You'll also explore the AWS CloudFormation template that facilitates automatic infrastructure deployment via Databricks.

You'll review the created cluster and acquaint yourself with the dataset that will underpin your ETL process.

Practice: Data Processing

Learn two methods to load data into Databricks: directly or through S3, followed by integration. Additionally, you'll learn how to establish code repositories, either by connecting a GitHub repository or creating one manually within Databricks.

Throughout the project, you'll accomplish two primary tasks:

  • ETL Data Processing: Execute the pipeline, perform transformations, create tables, and save them in Databricks.
  • Data Visualization: Conduct analysis with Spark SQL in a separate notebook and develop visualizations.

You will also understand the data storage mechanisms within Databricks.

Data Warehouse and External Integrations

In the final stage, you will connect Power BI to Databricks and experiment with two integration methods: via a compute cluster and through SQL Warehouse. This experience will teach you how to integrate Databricks with external analytics tools effectively.

Recommendations Before Starting

Before commencing this course, it is recommended to complete the "Apache Spark Basics" course. These foundational skills are crucial for effective work in Databricks.

Requirements

  • AWS account
  • Databricks account
  • Basic knowledge of Spark (equivalent to the "Spark Fundamentals" course)
  • Minimal costs on AWS (especially within the free tier)

Additional

Link to the dataset: https://www.kaggle.com/datasets/carrie1/ecommerce-data

Link to the GitHub repo: https://github.com/team-data-science/databricks

About the Author: Andreas Kretz

Andreas Kretz thumbnail

Andreas Kretz is a German data engineer and one of the most widely followed independent voices on data engineering as a career discipline. He runs the Plumbers of Data Science brand and has been publishing tutorial material continuously since the field consolidated around the modern lake-house stack (Spark, Kafka, Snowflake, Databricks, Airflow).

His CourseFlix listing is the largest single-author catalog under this source — over thirty courses spanning data-pipeline construction, streaming architectures, the cloud-native data stack on AWS / Azure / GCP, the Python and Scala tooling that dominates the field, and the soft-skills / career side of breaking into data engineering. Material is paid and aimed at engineers transitioning into data work or already-working data engineers picking up specific tools.

Watch Online 19 lessons

This is a demo lesson (10:00 remaining)

You can watch up to 10 minutes for free. Subscribe to unlock all 19 lessons in this course and access 10,000+ hours of premium content across all courses.

View Pricing
0:00
/
#1: Introduction
All Course Lessons (19)
#Lesson TitleDurationAccess
1
Introduction Demo
02:56
2
Why Databricks
04:05
3
Pricing explained
06:51
4
Create Databricks Account & Workspace
07:09
5
AWS Resources created by Databricks
04:03
6
Intro Databricks UI & Compute Cluster
06:05
7
The Dataset
02:46
8
Goals ETL & Visualization pipeline explained
02:16
9
Import Data in Databricks UI
04:44
10
Databricks Data in S3
02:09
11
Creating code Repos
04:35
12
Running our ETL job
09:26
13
Explore Data Tables in AWS folders
02:16
14
Explore data with databricks notebook 1
05:55
15
Explore data with databricks notebook 2
06:45
16
Compute Cluster vs Databricks SQL Warehouse
04:11
17
Power BI queries through computer cluster
04:21
18
Power BI queries through Databricks SQL Warehouse
04:44
19
Conclusion
02:12
Unlock unlimited learning

Get instant access to all 18 lessons in this course, plus thousands of other premium courses. One subscription, unlimited knowledge.

Learn more about subscription

Related courses

Frequently asked questions

What prerequisites are needed before enrolling in this course?
Prospective students should have a basic understanding of data processing concepts and familiarity with cloud computing, specifically AWS, as the course involves setting up Databricks on AWS. Experience with Apache Spark or any other data processing framework will be beneficial but is not required. Basic programming skills, particularly in SQL, will also help in comprehending the ETL and data visualization exercises.
What projects will I work on during the course?
During the course, you will engage in a project that involves setting up a Databricks environment on AWS, creating an S3 bucket for data storage, and configuring a workspace. You will execute an ETL data processing pipeline to perform data transformations, create tables, and save results. Additionally, you'll analyze data using Spark SQL and develop visualizations in a separate notebook.
Who is the target audience for this course?
This course is targeted at data engineers and analysts who are interested in leveraging Databricks for data processing and visualization. It's suitable for professionals who want to enhance their skills in building modern data warehouses, specifically using the Lakehouse architecture. Individuals looking to integrate BI tools, like Power BI, with Databricks will also find this course valuable.
How does this course compare in depth and scope to similar courses?
The course provides a detailed introduction to Databricks, focusing on practical implementation within AWS. It covers setting up a computing cluster, configuring workspaces, and integrating with BI tools. Unlike some introductory courses, it includes hands-on projects that guide students through the entire process of data pipeline creation and data visualization. The integration of GitHub for code repository management also adds a layer of practical application.
What tools and platforms are specifically covered in this course?
Specific tools and platforms covered in this course include Databricks for data processing, AWS for cloud infrastructure setup, and S3 for data storage. The course also includes practical exercises with Spark SQL for data analysis and visualization. Additionally, it explores the integration of Power BI with Databricks SQL Warehouse for executing queries and generating insights.
What is not covered in this course?
The course does not cover advanced machine learning algorithms or detailed big data architecture design principles. It also does not delve into the specifics of other cloud service providers beyond AWS. While Power BI integration is discussed, the course does not provide a comprehensive guide on Power BI itself or other BI tools beyond basic query execution.
What is the expected time commitment for completing the course?
The course comprises 19 lessons, and while the total runtime isn't specified, students should allocate additional time for setting up the environment, engaging in hands-on projects, and practicing the integration of tools like Power BI. Depending on prior experience, students should expect to spend several hours on each practical component to fully grasp the concepts and complete the exercises.