The Data Engineering Bootcamp: Zero to Mastery

13h 23m 15s
English
Paid

Learn how to build streaming pipelines with Apache Kafka and Flink, create data lakes on AWS, run ML workflows on Spark, and integrate LLM models into production systems. This course is designed to kickstart your career and make you a sought-after data engineer of tomorrow.

Read more about the course

Why is Data Engineering the new major profession in IT?

Data Engineering is rapidly becoming one of the fastest-growing and in-demand professions in the tech world. With the rise of AI products, analytical systems, and real-time applications, companies are actively developing their data infrastructures, which drives the demand for specialists.

Just last year, more than 20,000 new data engineer positions were created, and the total number of open positions in North America approached 150,000, clearly demonstrating the explosive growth of the industry.

Moreover, the salaries are impressive:

  • Entry level - from $80,000 to $110,000 per year
  • Mid and senior level - up to $190,000–$200,000+

Furthermore, data engineers play a strategic role: they build the foundation for machine learning systems, analytics, and AI, without which modern tech products are impossible. With the further growth of AI, the demand for data engineers will only increase, creating excellent opportunities for a long-term career and financial stability.

Why this particular bootcamp?

Our bootcamp is designed to be as comprehensive and practical as possible, without unnecessary theory or outdated tutorials. You will learn step by step and build real projects using the same tools that professionals use.

You will start with Apache Spark, processing real Airbnb data and mastering large-scale computations. Then you will create a modern data lake on AWS using S3, EMR, Glue, and Athena. You will learn pipeline orchestration with Apache Airflow, build streaming systems on Kafka and Flink, and even integrate machine learning and LLM (Large Language Models) directly into the pipelines.

As a result, you will learn to build end-to-end production-level systems - the exact skills employers are looking for.

What's inside the course?

  • Introduction to Data Engineering
    • Understand how modern data engineering works and what is needed to start.
  • Big Data Processing with Apache Spark
    • Learn to work with large datasets using DataFrame API, UDF, aggregations, and optimization.
  • Building a data lake on AWS
    • Build scalable data storage using S3, EMR, and Athena.
  • Pipelines with Apache Airflow
    • Automate and manage tasks, handle errors, schedule and run Spark jobs.
  • ML with Spark MLlib
    • Embed machine learning in your pipelines - classification, regression, model selection.
  • AI and LLM in Data Engineering
    • Use Hugging Face and other tools to integrate LLM in data processing.
  • Stream Processing with Apache Kafka and Flink
    • Create real-time systems, process events, work with real-time streams.

Outcome

After completing the course, you won’t just have watched videos - you'll become a true data engineer, ready to build systems that companies need today.

Thousands of our graduates already work at Google, Tesla, Amazon, Apple, IBM, JP Morgan, Facebook, Shopify, and other top companies.

Many of them started from scratch. So why not become the next one?

Watch Online The Data Engineering Bootcamp: Zero to Mastery

Join premium to watch
Go to premium
# Title Duration
1 The Data Engineering Bootcamp: Zero to Mastery 01:35
2 Introduction to Data Engineering 04:17
3 Who Are Data Engineers? 04:43
4 Prerequisites 03:19
5 Source Code for This Bootcamp 01:19
6 Plan for This Bootcamp 04:38
7 [Optional] What Is a Virtualenv? 06:37
8 [Optional] What Is Docker? 11:03
9 Introduction 04:08
10 Apache Spark 03:44
11 How Spark Works 04:24
12 Spark Application 07:41
13 DataFrames 06:43
14 Installing Spark 05:51
15 Inside Airbnb Data 07:02
16 Writing Your First Spark Job 07:05
17 Lazy Processing 02:16
18 [Exercise] Basic Functions 01:29
19 [Exercise] Basic Functions - Solution 06:41
20 Aggregating Data 04:00
21 Joining Data 04:40
22 Aggregations and Joins with Spark 06:10
23 Complex Data Types 05:09
24 [Exercise] Aggregate Functions 00:50
25 [Exercise] Aggregate Functions - Solution 05:54
26 User Defined Functions 03:25
27 Data Shuffle 06:14
28 Data Accumulators 03:42
29 Optimizing Spark Jobs 07:39
30 Submitting Spark Jobs 04:29
31 Other Spark APIs 05:16
32 Spark SQL 04:33
33 [Exercise] Advanced Spark 02:10
34 [Exercise] Advanced Spark - Solution 05:26
35 Summary 03:08
36 Introduction 04:26
37 What Is a Data Lake? 09:08
38 Amazon Web Services (AWS) 07:47
39 Simple Storage Service (S3) 05:45
40 Setting Up an AWS Account 09:29
41 Data Partitioning 03:24
42 Using S3 07:49
43 EMR Serverless 02:59
44 IAM Roles 02:52
45 Running a Spark Job 08:49
46 Parquet Data Format 07:41
47 Implementing a Data Catalog 05:32
48 Data Catalog Demo 06:42
49 Querying a Data Lake 04:00
50 Summary 03:39
51 Introduction 05:53
52 What Is Apache Airflow? 05:19
53 Airflow’s Architecture 03:15
54 Installing Airflow 06:33
55 Defining an Airflow DAG 08:03
56 Errors Handling 03:38
57 Idempotent Tasks 04:54
58 Creating a DAG - Part 1 04:58
59 Creating a DAG - Part 2 04:42
60 Handling Failed Tasks 04:09
61 [Exercise] Data Validation 04:31
62 [Exercise] Data Validation - Solution 03:27
63 Spark with Airflow 03:02
64 Using Spark with Airflow - Part 1 07:39
65 Using Spark with Airflow - Part 2 05:52
66 Sensors In Airflow 04:46
67 Using File Sensors 04:08
68 Data Ingestion 05:50
69 Reading Data From Postgres - Part 1 06:03
70 Reading Data from Postgres - Part 2 05:40
71 [Exercise] Average Customer Review 03:53
72 [Exercise] Average Customer Review - Solution 04:33
73 Advanced DAGs 04:26
74 Summary 02:27
75 Introduction 05:28
76 What Is Machine Learning 06:06
77 Regression Algorithms 05:38
78 Building a Regression Model 05:04
79 Training a Model 09:46
80 Model Evaluation 07:26
81 Testing a Regression Model 03:57
82 Model Lifecycle 02:12
83 Feature Engineering 08:44
84 Improving a Regression Model 07:34
85 Machine Learning Pipelines 03:56
86 Creating a Pipeline 02:41
87 [Exercise] House Price Estimation 01:59
88 [Exercise] House Price Estimation - Solution 03:12
89 [Exercise] Imposter Syndrome 02:57
90 Classification 07:37
91 Classifiers Evaluation 04:27
92 Training a Classifier 08:31
93 Hyperparameters 08:06
94 Optimizing a Model 03:02
95 [Exercise] Loan Approval 02:34
96 [Exercise] Load Approval - Solution 02:33
97 Deep Learning 06:56
98 Summary 03:23
99 Introduction 05:07
100 Natural Language Processing (NLP) before LLMs 06:11
101 Transformers 06:21
102 Types of LLMs 07:40
103 Hugging Face 02:19
104 Databricks Set Up 10:38
105 Using an LLM 07:36
106 Structured Output 03:42
107 Producing JSON Output 05:10
108 LLMs With Apache Spark 05:20
109 Summary 02:48
110 Introduction 06:06
111 What Is Apache Kafka? 07:00
112 Partitioning Data 08:56
113 Kafka API 07:42
114 Kafka Architecture 03:15
115 Set Up Kafka 05:53
116 Writing to Kafka 06:07
117 Reading from Kafka 07:37
118 Data Durability 06:39
119 Kafka vs Queues 02:11
120 [Exercise] Processing Records 03:44
121 [Exercise] Processing Records - Solution 02:59
122 Delivery Semantics 05:53
123 Kafka Transactions 04:34
124 Log Compaction 03:23
125 Kafka Connect 06:59
126 Using Kafka Connect 09:44
127 Outbox Pattern 04:31
128 Schema Registry 08:01
129 Using Schema Registry 08:10
130 Tiered Storage 03:28
131 [Exercise] Track Order Status Changes 04:27
132 [Exercise] Track Order Status Changes - Solution 05:06
133 Summary 04:41
134 Introduction 05:40
135 What Is Apache Flink? 05:24
136 Kafka Application 08:11
137 Multiple Streams 03:11
138 Installing Apache Flink 05:46
139 Processing Individual Records 07:22
140 [Exercise] Stream Processing 04:02
141 [Exercise] Stream Processing - Solution 02:40
142 Time Windows 06:49
143 Keyed Windows 02:40
144 Using Time Windows 05:18
145 Watermarks 10:06
146 Advanced Window Operations 06:17
147 Stateful Stream Processing 07:50
148 Using Local State 04:42
149 [Exercise] Anomalies Detection 04:35
150 [Exercise] Anomalies Detection - Solution 03:34
151 Joining Streams 05:50
152 Summary 03:10
153 Thank You! 01:18

Similar courses to The Data Engineering Bootcamp: Zero to Mastery

Business Intelligence with Excel

Business Intelligence with Excelzerotomastery.io

Category: Data processing and analysis
Duration 7 hours 41 minutes 24 seconds
PyTorch for Deep Learning with Python Bootcamp

PyTorch for Deep Learning with Python Bootcampudemy

Category: Python, Data processing and analysis
Duration 17 hours 2 minutes 14 seconds
Data Platform & Pipeline Design

Data Platform & Pipeline DesignAndreas Kretz

Category: Data processing and analysis
Duration 1 hour 59 minutes 5 seconds
Building APIs with FastAPI

Building APIs with FastAPIAndreas Kretz

Category: Python, Data processing and analysis
Duration 1 hour 35 minutes 40 seconds
Complete linear algebra: theory and implementation

Complete linear algebra: theory and implementationudemy

Category: Python, Data processing and analysis
Duration 32 hours 53 minutes 26 seconds
Data Analysis for Beginners: Python & Statistics

Data Analysis for Beginners: Python & Statisticszerotomastery.io

Category: Python, Data processing and analysis
Duration 6 hours 34 minutes 20 seconds
Data Analysis for Beginners: Excel & Pivot Tables

Data Analysis for Beginners: Excel & Pivot Tableszerotomastery.io

Category: Data processing and analysis
Duration 2 hours 10 minutes 21 seconds
Choosing Data Stores

Choosing Data StoresAndreas Kretz

Category: Data processing and analysis
Duration 1 hour 25 minutes 31 seconds
Becoming a Better Data Engineer

Becoming a Better Data EngineerAndreas Kretz

Category: Data processing and analysis
Duration 1 hour 46 minutes 10 seconds
Python for Data Science and Machine Learning Bootcamp

Python for Data Science and Machine Learning Bootcampudemy

Category: Python, Data processing and analysis
Duration 24 hours 49 minutes 42 seconds