Learning Apache Spark
Read more about the course
In this course, you will learn about Spark architecture and the basic principles of its operation. You will practice with transformations and actions in Spark, working in Jupyter Notebooks within a Docker environment. You will also be introduced to DataFrame, SparkSQL, and RDD to understand how to use them for processing structured and unstructured data. Upon completion of the course, you will be ready to write your own Spark jobs and build pipelines based on it.
Basics of Spark
You will understand why using Spark is beneficial, the difference between vertical and horizontal scaling. Learn about the types of data Spark can work with and where it can be deployed.
You will delve into the architecture: executor, driver, context, cluster manager, and the types of clusters. You will also study the differences between client and cluster deployment modes.
Data and Development Environment
You will get acquainted with the tools we will use in the course. You will find out which dataset is chosen and how to set up the working environment: install Docker and Jupyter Notebook.
Spark Coding Basics
Before practice, you will understand the key concepts: what RDD and DataFrame are, and their characteristics when working with different types of data.
You will understand the difference between transformations and actions, how they interact with data and each other, and learn about the most commonly used types of operations.
Practice in Jupyter Notebook
In the GitHub repository, you will find all the code used in the course, allowing you to start working immediately.
You will work with five notebooks where you will learn to:
- apply transformations to data,
- work with schemas, columns, and data types,
- process JSON and CSV in DataFrames,
- merge and modify DataFrames,
- use Spark SQL to work with data,
- apply RDD for unstructured data.
Watch Online Learning Apache Spark
# | Title | Duration |
---|---|---|
1 | Introduction & Contents | 03:31 |
2 | Why Spark - Vertical vs Horizontal Scaling | 03:56 |
3 | What Spark Is Good For | 04:46 |
4 | Spark Driver, Context & Executors | 04:12 |
5 | Cluster Types | 02:00 |
6 | Client vs Cluster Deployment | 06:12 |
7 | Where to Run Spark | 03:39 |
8 | Tools in the Spark Course | 02:36 |
9 | The Dataset | 04:13 |
10 | Docker Setup | 02:53 |
11 | Jupyter Notebook Setup & Run | 05:32 |
12 | RDDs | 03:58 |
13 | DataFrames | 01:41 |
14 | Transformations & Actions Overview | 03:00 |
15 | Transformations | 02:23 |
16 | Actions | 03:07 |
17 | Notebook 1: JSON Transformations | 09:53 |
18 | Notebook 2: Working with Schemas | 08:24 |
19 | Notebook 3: Working With DataFrames | 10:10 |
20 | Notebook 4: SparkSQL | 05:05 |
21 | Notebook 5: Working with RDDs | 12:53 |
Similar courses to Learning Apache Spark

Choosing Data StoresAndreas Kretz

The Data Science Course: Complete Data Science Bootcamp 2023udemy

Machine Learning A-Z : Become Kaggle Masterudemy

Dockerized ETL With AWS, TDengine & GrafanaAndreas Kretz

Relational Data ModelingEka Ponkratova

Data Engineering on AWSAndreas Kretz

DS4B 101-P: Python for Data Science AutomationBusiness Science University

Data Analysis with Pandas and Pythonudemy

Apache Kafka FundamentalsAndreas Kretz
