Skip to main content
CourseFlix

Learning Apache Spark

1h 44m 4s
English
Paid

Master Apache Spark with this in-depth course designed for data engineers seeking to enhance their data processing capabilities. Apache Spark is an essential tool, and proficiency in it can significantly streamline your workflow and optimize data pipeline efficiency.

Course Overview

In this course, you will explore the architecture and fundamental principles of Apache Spark. You will engage in hands-on activities with transformations and actions in Spark, using Jupyter Notebooks in a Docker environment. The course will cover DataFrame, SparkSQL, and RDD to equip you with skills for processing both structured and unstructured data. By the end, you will be competent in crafting your own Spark jobs and building data pipelines.

Basics of Spark

This section will cover the benefits of Spark, emphasizing the distinction between vertical and horizontal scaling. You will learn about the various data types Spark can handle and potential deployment environments. A deep dive into the architecture will reveal components like executors, drivers, context, cluster managers, and the different cluster types, alongside understanding the nuances between client and cluster deployment modes.

Data and Development Environment

Here, you will be introduced to the tools and setup required for this course. You will discover the chosen dataset and learn to install and configure your development environment, including Docker and Jupyter Notebook.

Spark Coding Basics

Before diving into practice, you will gain an understanding of core concepts such as RDD and DataFrame, identifying their properties and applications with various data types. You will grasp the distinction between transformations and actions, their interaction with data, and become familiar with commonly used operations.

Practical Exercises in Jupyter Notebook

A comprehensive GitHub repository will provide you with all the course's code, facilitating an easy start to your practical journey.

You will work with five dedicated notebooks, acquiring the skills to:

  • Apply transformations to datasets,
  • Work with schemas, columns, and data types,
  • Process JSON and CSV files using DataFrames,
  • Merge and modify DataFrames effectively,
  • Utilize Spark SQL for enhanced data operations,
  • Apply RDD for handling unstructured data scenarios.

Additional

https://github.com/team-data-science/learning-apache-spark/tree/main/data

https://github.com/team-data-science/learning-apache-spark/tree/main/sources

https://github.com/team-data-science/learning-apache-spark

About the Author: Andreas Kretz

Andreas Kretz thumbnail

Andreas Kretz is a German data engineer and one of the most widely followed independent voices on data engineering as a career discipline. He runs the Plumbers of Data Science brand and has been publishing tutorial material continuously since the field consolidated around the modern lake-house stack (Spark, Kafka, Snowflake, Databricks, Airflow).

His CourseFlix listing is the largest single-author catalog under this source — over thirty courses spanning data-pipeline construction, streaming architectures, the cloud-native data stack on AWS / Azure / GCP, the Python and Scala tooling that dominates the field, and the soft-skills / career side of breaking into data engineering. Material is paid and aimed at engineers transitioning into data work or already-working data engineers picking up specific tools.

Watch Online 21 lessons

This is a demo lesson (10:00 remaining)

You can watch up to 10 minutes for free. Subscribe to unlock all 21 lessons in this course and access 10,000+ hours of premium content across all courses.

View Pricing
0:00
/
#1: Introduction & Contents
All Course Lessons (21)
#Lesson TitleDurationAccess
1
Introduction & Contents Demo
03:31
2
Why Spark - Vertical vs Horizontal Scaling
03:56
3
What Spark Is Good For
04:46
4
Spark Driver, Context & Executors
04:12
5
Cluster Types
02:00
6
Client vs Cluster Deployment
06:12
7
Where to Run Spark
03:39
8
Tools in the Spark Course
02:36
9
The Dataset
04:13
10
Docker Setup
02:53
11
Jupyter Notebook Setup & Run
05:32
12
RDDs
03:58
13
DataFrames
01:41
14
Transformations & Actions Overview
03:00
15
Transformations
02:23
16
Actions
03:07
17
Notebook 1: JSON Transformations
09:53
18
Notebook 2: Working with Schemas
08:24
19
Notebook 3: Working With DataFrames
10:10
20
Notebook 4: SparkSQL
05:05
21
Notebook 5: Working with RDDs
12:53
Unlock unlimited learning

Get instant access to all 20 lessons in this course, plus thousands of other premium courses. One subscription, unlimited knowledge.

Learn more about subscription

Related courses

  • Data Engineering on Databricks thumbnail

    Data Engineering on Databricks

    Sources: Andreas Kretz
    Learn Databricks for data processing using Apache Spark. This course covers setup on AWS, ETL processes, data visualization, and BI tools integration.
    1 hour 27 minutes 29 seconds 5 / 5
  • TensorFlow Developer Certificate in 2023: Zero to Mastery thumbnail

    TensorFlow Developer Certificate in 2023: Zero to Mastery

    Sources: zerotomastery.io
    Learn TensorFlow. Pass the TensorFlow Developer Certificate Exam. Get Hired as a TensorFlow developer. This course will take you from a TensorFlow beginner to b
    62 hours 43 minutes 54 seconds
  • 2022 Python for Machine Learning & Data Science Masterclass thumbnail

    2022 Python for Machine Learning & Data Science Masterclass

    Sources: Udemy
    Welcome to the most complete course on learning Data Science and Machine Learning on the internet! After teaching over 2 million students I've worked for over a
    44 hours 5 minutes 31 seconds