Skip to main content

Learning Apache Spark

1h 44m 4s
English
Paid

Master Apache Spark with this in-depth course designed for data engineers seeking to enhance their data processing capabilities. Apache Spark is an essential tool, and proficiency in it can significantly streamline your workflow and optimize data pipeline efficiency.

Course Overview

In this course, you will explore the architecture and fundamental principles of Apache Spark. You will engage in hands-on activities with transformations and actions in Spark, using Jupyter Notebooks in a Docker environment. The course will cover DataFrame, SparkSQL, and RDD to equip you with skills for processing both structured and unstructured data. By the end, you will be competent in crafting your own Spark jobs and building data pipelines.

Basics of Spark

This section will cover the benefits of Spark, emphasizing the distinction between vertical and horizontal scaling. You will learn about the various data types Spark can handle and potential deployment environments. A deep dive into the architecture will reveal components like executors, drivers, context, cluster managers, and the different cluster types, alongside understanding the nuances between client and cluster deployment modes.

Data and Development Environment

Here, you will be introduced to the tools and setup required for this course. You will discover the chosen dataset and learn to install and configure your development environment, including Docker and Jupyter Notebook.

Spark Coding Basics

Before diving into practice, you will gain an understanding of core concepts such as RDD and DataFrame, identifying their properties and applications with various data types. You will grasp the distinction between transformations and actions, their interaction with data, and become familiar with commonly used operations.

Practical Exercises in Jupyter Notebook

A comprehensive GitHub repository will provide you with all the course's code, facilitating an easy start to your practical journey.

You will work with five dedicated notebooks, acquiring the skills to:

  • Apply transformations to datasets,
  • Work with schemas, columns, and data types,
  • Process JSON and CSV files using DataFrames,
  • Merge and modify DataFrames effectively,
  • Utilize Spark SQL for enhanced data operations,
  • Apply RDD for handling unstructured data scenarios.

About the Author: Andreas Kretz

Andreas Kretz thumbnail

I am a senior data engineer and trainer, a tech enthusiast, and a father. For more than ten years, I have been passionate about Data Engineering. Initially, I became a self-taught data engineer and then led a team of data engineers at a large company. When I realized the great demand for education in this field, I followed my passion and founded my own Data Engineering Academy. Since then, I have helped over 2,000 students achieve their goals.

Watch Online 21 lessons

This is a demo lesson (10:00 remaining)

You can watch up to 10 minutes for free. Subscribe to unlock all 21 lessons in this course and access 10,000+ hours of premium content across all courses.

View Pricing
0:00
/
#1: Introduction & Contents
All Course Lessons (21)
#Lesson TitleDurationAccess
1
Introduction & Contents Demo
03:31
2
Why Spark - Vertical vs Horizontal Scaling
03:56
3
What Spark Is Good For
04:46
4
Spark Driver, Context & Executors
04:12
5
Cluster Types
02:00
6
Client vs Cluster Deployment
06:12
7
Where to Run Spark
03:39
8
Tools in the Spark Course
02:36
9
The Dataset
04:13
10
Docker Setup
02:53
11
Jupyter Notebook Setup & Run
05:32
12
RDDs
03:58
13
DataFrames
01:41
14
Transformations & Actions Overview
03:00
15
Transformations
02:23
16
Actions
03:07
17
Notebook 1: JSON Transformations
09:53
18
Notebook 2: Working with Schemas
08:24
19
Notebook 3: Working With DataFrames
10:10
20
Notebook 4: SparkSQL
05:05
21
Notebook 5: Working with RDDs
12:53
Unlock unlimited learning

Get instant access to all 20 lessons in this course, plus thousands of other premium courses. One subscription, unlimited knowledge.

Learn more about subscription