Learning Apache Spark
Read more about the course
In this course, you will learn about Spark architecture and the basic principles of its operation. You will practice with transformations and actions in Spark, working in Jupyter Notebooks within a Docker environment. You will also be introduced to DataFrame, SparkSQL, and RDD to understand how to use them for processing structured and unstructured data. Upon completion of the course, you will be ready to write your own Spark jobs and build pipelines based on it.
Basics of Spark
You will understand why using Spark is beneficial, the difference between vertical and horizontal scaling. Learn about the types of data Spark can work with and where it can be deployed.
You will delve into the architecture: executor, driver, context, cluster manager, and the types of clusters. You will also study the differences between client and cluster deployment modes.
Data and Development Environment
You will get acquainted with the tools we will use in the course. You will find out which dataset is chosen and how to set up the working environment: install Docker and Jupyter Notebook.
Spark Coding Basics
Before practice, you will understand the key concepts: what RDD and DataFrame are, and their characteristics when working with different types of data.
You will understand the difference between transformations and actions, how they interact with data and each other, and learn about the most commonly used types of operations.
Practice in Jupyter Notebook
In the GitHub repository, you will find all the code used in the course, allowing you to start working immediately.
You will work with five notebooks where you will learn to:
- apply transformations to data,
- work with schemas, columns, and data types,
- process JSON and CSV in DataFrames,
- merge and modify DataFrames,
- use Spark SQL to work with data,
- apply RDD for unstructured data.
Watch Online Learning Apache Spark
# | Title | Duration |
---|---|---|
1 | Introduction & Contents | 03:31 |
2 | Why Spark - Vertical vs Horizontal Scaling | 03:56 |
3 | What Spark Is Good For | 04:46 |
4 | Spark Driver, Context & Executors | 04:12 |
5 | Cluster Types | 02:00 |
6 | Client vs Cluster Deployment | 06:12 |
7 | Where to Run Spark | 03:39 |
8 | Tools in the Spark Course | 02:36 |
9 | The Dataset | 04:13 |
10 | Docker Setup | 02:53 |
11 | Jupyter Notebook Setup & Run | 05:32 |
12 | RDDs | 03:58 |
13 | DataFrames | 01:41 |
14 | Transformations & Actions Overview | 03:00 |
15 | Transformations | 02:23 |
16 | Actions | 03:07 |
17 | Notebook 1: JSON Transformations | 09:53 |
18 | Notebook 2: Working with Schemas | 08:24 |
19 | Notebook 3: Working With DataFrames | 10:10 |
20 | Notebook 4: SparkSQL | 05:05 |
21 | Notebook 5: Working with RDDs | 12:53 |
Similar courses to Learning Apache Spark

Case Study in Causal AnalysisLunarTech

Complete linear algebra: theory and implementationudemy

Dockerized ETL With AWS, TDengine & GrafanaAndreas Kretz

Business Intelligence with Excelzerotomastery.io

Data Analysis with Pandas and Pythonudemy

SQL & Database Design A-Z™: Learn MS SQL Server + PostgreSQLudemy

Data Analysis for Beginners: Python & Statisticszerotomastery.io

The Data Bootcamp: Transform your Data using dbt™udemy

Relational Data ModelingEka Ponkratova
