Learning Apache Spark

1h 44m 4s
English
Paid
After building data pipelines, data processing is one of the most important tasks in Data Engineering. As a data engineer, you constantly face the need for processing and it's crucial to be able to configure a powerful and distributed processing system. One of the most useful and widely used tools for this is Apache Spark.
Read more about the course

In this course, you will learn about Spark architecture and the basic principles of its operation. You will practice with transformations and actions in Spark, working in Jupyter Notebooks within a Docker environment. You will also be introduced to DataFrame, SparkSQL, and RDD to understand how to use them for processing structured and unstructured data. Upon completion of the course, you will be ready to write your own Spark jobs and build pipelines based on it.

Basics of Spark

You will understand why using Spark is beneficial, the difference between vertical and horizontal scaling. Learn about the types of data Spark can work with and where it can be deployed.

You will delve into the architecture: executor, driver, context, cluster manager, and the types of clusters. You will also study the differences between client and cluster deployment modes.

Data and Development Environment

You will get acquainted with the tools we will use in the course. You will find out which dataset is chosen and how to set up the working environment: install Docker and Jupyter Notebook.

Spark Coding Basics

Before practice, you will understand the key concepts: what RDD and DataFrame are, and their characteristics when working with different types of data.

You will understand the difference between transformations and actions, how they interact with data and each other, and learn about the most commonly used types of operations.

Practice in Jupyter Notebook

In the GitHub repository, you will find all the code used in the course, allowing you to start working immediately.

You will work with five notebooks where you will learn to:

  • apply transformations to data,
  • work with schemas, columns, and data types,
  • process JSON and CSV in DataFrames,
  • merge and modify DataFrames,
  • use Spark SQL to work with data,
  • apply RDD for unstructured data.

Watch Online Learning Apache Spark

Join premium to watch
Go to premium
# Title Duration
1 Introduction & Contents 03:31
2 Why Spark - Vertical vs Horizontal Scaling 03:56
3 What Spark Is Good For 04:46
4 Spark Driver, Context & Executors 04:12
5 Cluster Types 02:00
6 Client vs Cluster Deployment 06:12
7 Where to Run Spark 03:39
8 Tools in the Spark Course 02:36
9 The Dataset 04:13
10 Docker Setup 02:53
11 Jupyter Notebook Setup & Run 05:32
12 RDDs 03:58
13 DataFrames 01:41
14 Transformations & Actions Overview 03:00
15 Transformations 02:23
16 Actions 03:07
17 Notebook 1: JSON Transformations 09:53
18 Notebook 2: Working with Schemas 08:24
19 Notebook 3: Working With DataFrames 10:10
20 Notebook 4: SparkSQL 05:05
21 Notebook 5: Working with RDDs 12:53

Similar courses to Learning Apache Spark

Case Study in Causal Analysis

Case Study in Causal AnalysisLunarTech

Category: Data processing and analysis
Duration 2 hours 3 minutes 34 seconds
Complete linear algebra: theory and implementation

Complete linear algebra: theory and implementationudemy

Category: Python, Data processing and analysis
Duration 32 hours 53 minutes 26 seconds
Dockerized ETL With AWS, TDengine & Grafana

Dockerized ETL With AWS, TDengine & GrafanaAndreas Kretz

Category: Data processing and analysis
Duration 29 minutes 12 seconds
Business Intelligence with Excel

Business Intelligence with Excelzerotomastery.io

Category: Data processing and analysis
Duration 7 hours 41 minutes 24 seconds
Data Analysis with Pandas and Python

Data Analysis with Pandas and Pythonudemy

Category: Python, Data processing and analysis
Duration 19 hours 5 minutes 40 seconds
SQL & Database Design A-Z™: Learn MS SQL Server + PostgreSQL

SQL & Database Design A-Z™: Learn MS SQL Server + PostgreSQLudemy

Category: Sql, Data processing and analysis
Duration 12 hours 32 minutes 7 seconds
Data Analysis for Beginners: Python & Statistics

Data Analysis for Beginners: Python & Statisticszerotomastery.io

Category: Python, Data processing and analysis
Duration 6 hours 34 minutes 20 seconds
The Data Bootcamp: Transform your Data using dbt™

The Data Bootcamp: Transform your Data using dbt™udemy

Category: Data processing and analysis
Duration 4 hours 10 minutes 51 seconds
Relational Data Modeling

Relational Data ModelingEka Ponkratova

Category: Data processing and analysis
Duration 1 hour 52 minutes
Data Engineering on GCP

Data Engineering on GCPAndreas Kretz

Category: Data processing and analysis
Duration 1 hour 17 minutes 33 seconds