Databricks is one of the most popular platforms for data processing using Apache Spark and is essential for creating modern data warehouses (Lakehouse). In this course, you will acquire the skills necessary for a confident start with Databricks, from understanding the basics of the platform to developing your own data pipelines and integrating BI tools.
You'll delve into how Databricks functions, why it's beneficial, how to create your notebooks, configure a computing cluster, and familiarize yourself with Databricks SQL Warehouse.
Installation and Data Preparation
Before engaging in practical activities, you will set up Databricks on AWS, create an S3 bucket for data storage, and configure a workspace. You'll also explore the AWS CloudFormation template that facilitates automatic infrastructure deployment via Databricks.
You'll review the created cluster and acquaint yourself with the dataset that will underpin your ETL process.
Practice: Data Processing
Learn two methods to load data into Databricks: directly or through S3, followed by integration. Additionally, you'll learn how to establish code repositories, either by connecting a GitHub repository or creating one manually within Databricks.
Throughout the project, you'll accomplish two primary tasks:
- ETL Data Processing: Execute the pipeline, perform transformations, create tables, and save them in Databricks.
- Data Visualization: Conduct analysis with Spark SQL in a separate notebook and develop visualizations.
You will also understand the data storage mechanisms within Databricks.
Data Warehouse and External Integrations
In the final stage, you will connect Power BI to Databricks and experiment with two integration methods: via a compute cluster and through SQL Warehouse. This experience will teach you how to integrate Databricks with external analytics tools effectively.
Recommendations Before Starting
Before commencing this course, it is recommended to complete the "Apache Spark Basics" course. These foundational skills are crucial for effective work in Databricks.
Requirements
- AWS account
- Databricks account
- Basic knowledge of Spark (equivalent to the "Spark Fundamentals" course)
- Minimal costs on AWS (especially within the free tier)