Skip to main content

Learning Apache Spark

1h 44m 4s
English
Paid

Course description

After building data pipelines, data processing is one of the most important tasks in Data Engineering. As a data engineer, you constantly face the need for processing and it's crucial to be able to configure a powerful and distributed processing system. One of the most useful and widely used tools for this is Apache Spark.
Read more about the course

In this course, you will learn about Spark architecture and the basic principles of its operation. You will practice with transformations and actions in Spark, working in Jupyter Notebooks within a Docker environment. You will also be introduced to DataFrame, SparkSQL, and RDD to understand how to use them for processing structured and unstructured data. Upon completion of the course, you will be ready to write your own Spark jobs and build pipelines based on it.

Basics of Spark

You will understand why using Spark is beneficial, the difference between vertical and horizontal scaling. Learn about the types of data Spark can work with and where it can be deployed.

You will delve into the architecture: executor, driver, context, cluster manager, and the types of clusters. You will also study the differences between client and cluster deployment modes.

Data and Development Environment

You will get acquainted with the tools we will use in the course. You will find out which dataset is chosen and how to set up the working environment: install Docker and Jupyter Notebook.

Spark Coding Basics

Before practice, you will understand the key concepts: what RDD and DataFrame are, and their characteristics when working with different types of data.

You will understand the difference between transformations and actions, how they interact with data and each other, and learn about the most commonly used types of operations.

Practice in Jupyter Notebook

In the GitHub repository, you will find all the code used in the course, allowing you to start working immediately.

You will work with five notebooks where you will learn to:

  • apply transformations to data,
  • work with schemas, columns, and data types,
  • process JSON and CSV in DataFrames,
  • merge and modify DataFrames,
  • use Spark SQL to work with data,
  • apply RDD for unstructured data.

Watch Online

This is a demo lesson (10:00 remaining)

You can watch up to 10 minutes for free. Subscribe to unlock all 21 lessons in this course and access 10,000+ hours of premium content across all courses.

View Pricing
0:00
/
#1: Introduction & Contents

All Course Lessons (21)

#Lesson TitleDurationAccess
1
Introduction & Contents Demo
03:31
2
Why Spark - Vertical vs Horizontal Scaling
03:56
3
What Spark Is Good For
04:46
4
Spark Driver, Context & Executors
04:12
5
Cluster Types
02:00
6
Client vs Cluster Deployment
06:12
7
Where to Run Spark
03:39
8
Tools in the Spark Course
02:36
9
The Dataset
04:13
10
Docker Setup
02:53
11
Jupyter Notebook Setup & Run
05:32
12
RDDs
03:58
13
DataFrames
01:41
14
Transformations & Actions Overview
03:00
15
Transformations
02:23
16
Actions
03:07
17
Notebook 1: JSON Transformations
09:53
18
Notebook 2: Working with Schemas
08:24
19
Notebook 3: Working With DataFrames
10:10
20
Notebook 4: SparkSQL
05:05
21
Notebook 5: Working with RDDs
12:53

Unlock unlimited learning

Get instant access to all 20 lessons in this course, plus thousands of other premium courses. One subscription, unlimited knowledge.

Learn more about subscription

Comments

0 comments

Want to join the conversation?

Sign in to comment

Similar courses

Storing & Visualizing Time Series Data

Storing & Visualizing Time Series Data

Sources: Andreas Kretz
Processing, storing, and visualizing time series data is becoming an increasingly important task. From IoT data and system logs to statistics...
2 hours 11 minutes 34 seconds
Machine Learning: Natural Language Processing in Python (V2)

Machine Learning: Natural Language Processing in Python (V2)

Sources: udemy
Welcome to Machine Learning: Natural Language Processing in Python (Version 2). NLP: Use Markov Models, NLTK, Artificial Intelligence, Deep Learning, Machine Le
22 hours 4 minutes 2 seconds
Apache Airflow Workflow Orchestration

Apache Airflow Workflow Orchestration

Sources: Andreas Kretz
Apache Airflow is a platform-independent tool for workflow orchestration that provides extensive capabilities for creating and...
1 hour 18 minutes 41 seconds
Case Study in Product Data Science

Case Study in Product Data Science

Sources: LunarTech
This is a course that offers unique opportunities for students seeking to master key aspects of data analysis in product development. The course...
1 hour 4 minutes 47 seconds
Business Intelligence with Excel

Business Intelligence with Excel

Sources: zerotomastery.io
The only course you need to launch your career as a Data Professional! Learn to master Excel's built-in power tools, including Power Query, Power Pivot Tables,
7 hours 41 minutes 24 seconds