Skip to main content
CF

Learning Apache Spark

1h 44m 4s
English
Paid

Master Apache Spark with this in-depth course designed for data engineers seeking to enhance their data processing capabilities. Apache Spark is an essential tool, and proficiency in it can significantly streamline your workflow and optimize data pipeline efficiency.

Course Overview

In this course, you will explore the architecture and fundamental principles of Apache Spark. You will engage in hands-on activities with transformations and actions in Spark, using Jupyter Notebooks in a Docker environment. The course will cover DataFrame, SparkSQL, and RDD to equip you with skills for processing both structured and unstructured data. By the end, you will be competent in crafting your own Spark jobs and building data pipelines.

Basics of Spark

This section will cover the benefits of Spark, emphasizing the distinction between vertical and horizontal scaling. You will learn about the various data types Spark can handle and potential deployment environments. A deep dive into the architecture will reveal components like executors, drivers, context, cluster managers, and the different cluster types, alongside understanding the nuances between client and cluster deployment modes.

Data and Development Environment

Here, you will be introduced to the tools and setup required for this course. You will discover the chosen dataset and learn to install and configure your development environment, including Docker and Jupyter Notebook.

Spark Coding Basics

Before diving into practice, you will gain an understanding of core concepts such as RDD and DataFrame, identifying their properties and applications with various data types. You will grasp the distinction between transformations and actions, their interaction with data, and become familiar with commonly used operations.

Practical Exercises in Jupyter Notebook

A comprehensive GitHub repository will provide you with all the course's code, facilitating an easy start to your practical journey.

You will work with five dedicated notebooks, acquiring the skills to:

  • Apply transformations to datasets,
  • Work with schemas, columns, and data types,
  • Process JSON and CSV files using DataFrames,
  • Merge and modify DataFrames effectively,
  • Utilize Spark SQL for enhanced data operations,
  • Apply RDD for handling unstructured data scenarios.

Additional

https://github.com/team-data-science/learning-apache-spark/tree/main/data

https://github.com/team-data-science/learning-apache-spark/tree/main/sources

https://github.com/team-data-science/learning-apache-spark

About the Author: Andreas Kretz

Andreas Kretz thumbnail

Andreas Kretz is a German data engineer and one of the most widely followed independent voices on data engineering as a career discipline. He runs the Plumbers of Data Science brand and has been publishing tutorial material continuously since the field consolidated around the modern lake-house stack (Spark, Kafka, Snowflake, Databricks, Airflow).

His CourseFlix listing is the largest single-author catalog under this source — over thirty courses spanning data-pipeline construction, streaming architectures, the cloud-native data stack on AWS / Azure / GCP, the Python and Scala tooling that dominates the field, and the soft-skills / career side of breaking into data engineering. Material is paid and aimed at engineers transitioning into data work or already-working data engineers picking up specific tools.

Watch Online 21 lessons

This is a demo lesson (10:00 remaining)

You can watch up to 10 minutes for free. Subscribe to unlock all 21 lessons in this course and access 10,000+ hours of premium content across all courses.

View Pricing
0:00
/
#1: Introduction & Contents
All Course Lessons (21)
#Lesson TitleDurationAccess
1
Introduction & Contents Demo
03:31
2
Why Spark - Vertical vs Horizontal Scaling
03:56
3
What Spark Is Good For
04:46
4
Spark Driver, Context & Executors
04:12
5
Cluster Types
02:00
6
Client vs Cluster Deployment
06:12
7
Where to Run Spark
03:39
8
Tools in the Spark Course
02:36
9
The Dataset
04:13
10
Docker Setup
02:53
11
Jupyter Notebook Setup & Run
05:32
12
RDDs
03:58
13
DataFrames
01:41
14
Transformations & Actions Overview
03:00
15
Transformations
02:23
16
Actions
03:07
17
Notebook 1: JSON Transformations
09:53
18
Notebook 2: Working with Schemas
08:24
19
Notebook 3: Working With DataFrames
10:10
20
Notebook 4: SparkSQL
05:05
21
Notebook 5: Working with RDDs
12:53
Unlock unlimited learning

Get instant access to all 20 lessons in this course, plus thousands of other premium courses. One subscription, unlimited knowledge.

Learn more about subscription

Related courses

Frequently asked questions

What are the prerequisites for enrolling in this course?
To enroll in this course, you should have a basic understanding of data engineering concepts and some experience with Python programming. Familiarity with Docker and Jupyter Notebooks is beneficial, as these tools are used extensively for hands-on activities within the course.
What projects or practical exercises are included in the course?
The course includes several hands-on exercises using Jupyter Notebooks. You will work on JSON transformations, handle schemas, manipulate DataFrames, and utilize SparkSQL. These activities are designed to help you build and execute Spark jobs and data pipelines.
Who is the target audience for this course?
This course is aimed at data engineers who want to enhance their data processing capabilities. It is ideal for professionals looking to master Apache Spark to streamline workflows and optimize data pipeline efficiency.
How does the depth and scope of this course compare to other Apache Spark courses?
The course provides an in-depth exploration of Apache Spark, focusing on both theoretical concepts and practical applications. It covers Spark's architecture, deployment environments, and data processing techniques, which may offer a more comprehensive foundation compared to courses that only focus on basic usage or specific features of Spark.
What specific tools or platforms will I learn to use in this course?
You will learn to use Apache Spark in conjunction with Docker and Jupyter Notebooks. The course guides you through the setup and configuration of these tools, which are essential for the hands-on activities and exercises involving Spark jobs.
What topics are not covered in this course?
The course does not cover advanced machine learning techniques or integrations with external data sources beyond the provided dataset. It focuses primarily on data processing capabilities using Spark, rather than on data science or analytics applications.
How much time should I expect to commit to complete this course?
The course consists of 21 lessons. While the exact runtime is not specified, you should allocate adequate time for both the theoretical lessons and the practical, hands-on exercises. The time commitment will vary based on your pace and familiarity with the concepts.