The Data Engineering Bootcamp: Zero to Mastery

13h 23m 15s
English
Paid

Learn how to build streaming pipelines with Apache Kafka and Flink, create data lakes on AWS, run ML workflows on Spark, and integrate LLM models into production systems. This course is designed to kickstart your career and make you a sought-after data engineer of tomorrow.

Read more about the course

Why is Data Engineering the new major profession in IT?

Data Engineering is rapidly becoming one of the fastest-growing and in-demand professions in the tech world. With the rise of AI products, analytical systems, and real-time applications, companies are actively developing their data infrastructures, which drives the demand for specialists.

Just last year, more than 20,000 new data engineer positions were created, and the total number of open positions in North America approached 150,000, clearly demonstrating the explosive growth of the industry.

Moreover, the salaries are impressive:

  • Entry level - from $80,000 to $110,000 per year
  • Mid and senior level - up to $190,000–$200,000+

Furthermore, data engineers play a strategic role: they build the foundation for machine learning systems, analytics, and AI, without which modern tech products are impossible. With the further growth of AI, the demand for data engineers will only increase, creating excellent opportunities for a long-term career and financial stability.

Why this particular bootcamp?

Our bootcamp is designed to be as comprehensive and practical as possible, without unnecessary theory or outdated tutorials. You will learn step by step and build real projects using the same tools that professionals use.

You will start with Apache Spark, processing real Airbnb data and mastering large-scale computations. Then you will create a modern data lake on AWS using S3, EMR, Glue, and Athena. You will learn pipeline orchestration with Apache Airflow, build streaming systems on Kafka and Flink, and even integrate machine learning and LLM (Large Language Models) directly into the pipelines.

As a result, you will learn to build end-to-end production-level systems - the exact skills employers are looking for.

What's inside the course?

  • Introduction to Data Engineering
    • Understand how modern data engineering works and what is needed to start.
  • Big Data Processing with Apache Spark
    • Learn to work with large datasets using DataFrame API, UDF, aggregations, and optimization.
  • Building a data lake on AWS
    • Build scalable data storage using S3, EMR, and Athena.
  • Pipelines with Apache Airflow
    • Automate and manage tasks, handle errors, schedule and run Spark jobs.
  • ML with Spark MLlib
    • Embed machine learning in your pipelines - classification, regression, model selection.
  • AI and LLM in Data Engineering
    • Use Hugging Face and other tools to integrate LLM in data processing.
  • Stream Processing with Apache Kafka and Flink
    • Create real-time systems, process events, work with real-time streams.

Outcome

After completing the course, you won’t just have watched videos - you'll become a true data engineer, ready to build systems that companies need today.

Thousands of our graduates already work at Google, Tesla, Amazon, Apple, IBM, JP Morgan, Facebook, Shopify, and other top companies.

Many of them started from scratch. So why not become the next one?

Watch Online The Data Engineering Bootcamp: Zero to Mastery

Join premium to watch
Go to premium
# Title Duration
1 The Data Engineering Bootcamp: Zero to Mastery 01:35
2 Introduction to Data Engineering 04:17
3 Who Are Data Engineers? 04:43
4 Prerequisites 03:19
5 Source Code for This Bootcamp 01:19
6 Plan for This Bootcamp 04:38
7 [Optional] What Is a Virtualenv? 06:37
8 [Optional] What Is Docker? 11:03
9 Introduction 04:08
10 Apache Spark 03:44
11 How Spark Works 04:24
12 Spark Application 07:41
13 DataFrames 06:43
14 Installing Spark 05:51
15 Inside Airbnb Data 07:02
16 Writing Your First Spark Job 07:05
17 Lazy Processing 02:16
18 [Exercise] Basic Functions 01:29
19 [Exercise] Basic Functions - Solution 06:41
20 Aggregating Data 04:00
21 Joining Data 04:40
22 Aggregations and Joins with Spark 06:10
23 Complex Data Types 05:09
24 [Exercise] Aggregate Functions 00:50
25 [Exercise] Aggregate Functions - Solution 05:54
26 User Defined Functions 03:25
27 Data Shuffle 06:14
28 Data Accumulators 03:42
29 Optimizing Spark Jobs 07:39
30 Submitting Spark Jobs 04:29
31 Other Spark APIs 05:16
32 Spark SQL 04:33
33 [Exercise] Advanced Spark 02:10
34 [Exercise] Advanced Spark - Solution 05:26
35 Summary 03:08
36 Introduction 04:26
37 What Is a Data Lake? 09:08
38 Amazon Web Services (AWS) 07:47
39 Simple Storage Service (S3) 05:45
40 Setting Up an AWS Account 09:29
41 Data Partitioning 03:24
42 Using S3 07:49
43 EMR Serverless 02:59
44 IAM Roles 02:52
45 Running a Spark Job 08:49
46 Parquet Data Format 07:41
47 Implementing a Data Catalog 05:32
48 Data Catalog Demo 06:42
49 Querying a Data Lake 04:00
50 Summary 03:39
51 Introduction 05:53
52 What Is Apache Airflow? 05:19
53 Airflow’s Architecture 03:15
54 Installing Airflow 06:33
55 Defining an Airflow DAG 08:03
56 Errors Handling 03:38
57 Idempotent Tasks 04:54
58 Creating a DAG - Part 1 04:58
59 Creating a DAG - Part 2 04:42
60 Handling Failed Tasks 04:09
61 [Exercise] Data Validation 04:31
62 [Exercise] Data Validation - Solution 03:27
63 Spark with Airflow 03:02
64 Using Spark with Airflow - Part 1 07:39
65 Using Spark with Airflow - Part 2 05:52
66 Sensors In Airflow 04:46
67 Using File Sensors 04:08
68 Data Ingestion 05:50
69 Reading Data From Postgres - Part 1 06:03
70 Reading Data from Postgres - Part 2 05:40
71 [Exercise] Average Customer Review 03:53
72 [Exercise] Average Customer Review - Solution 04:33
73 Advanced DAGs 04:26
74 Summary 02:27
75 Introduction 05:28
76 What Is Machine Learning 06:06
77 Regression Algorithms 05:38
78 Building a Regression Model 05:04
79 Training a Model 09:46
80 Model Evaluation 07:26
81 Testing a Regression Model 03:57
82 Model Lifecycle 02:12
83 Feature Engineering 08:44
84 Improving a Regression Model 07:34
85 Machine Learning Pipelines 03:56
86 Creating a Pipeline 02:41
87 [Exercise] House Price Estimation 01:59
88 [Exercise] House Price Estimation - Solution 03:12
89 [Exercise] Imposter Syndrome 02:57
90 Classification 07:37
91 Classifiers Evaluation 04:27
92 Training a Classifier 08:31
93 Hyperparameters 08:06
94 Optimizing a Model 03:02
95 [Exercise] Loan Approval 02:34
96 [Exercise] Load Approval - Solution 02:33
97 Deep Learning 06:56
98 Summary 03:23
99 Introduction 05:07
100 Natural Language Processing (NLP) before LLMs 06:11
101 Transformers 06:21
102 Types of LLMs 07:40
103 Hugging Face 02:19
104 Databricks Set Up 10:38
105 Using an LLM 07:36
106 Structured Output 03:42
107 Producing JSON Output 05:10
108 LLMs With Apache Spark 05:20
109 Summary 02:48
110 Introduction 06:06
111 What Is Apache Kafka? 07:00
112 Partitioning Data 08:56
113 Kafka API 07:42
114 Kafka Architecture 03:15
115 Set Up Kafka 05:53
116 Writing to Kafka 06:07
117 Reading from Kafka 07:37
118 Data Durability 06:39
119 Kafka vs Queues 02:11
120 [Exercise] Processing Records 03:44
121 [Exercise] Processing Records - Solution 02:59
122 Delivery Semantics 05:53
123 Kafka Transactions 04:34
124 Log Compaction 03:23
125 Kafka Connect 06:59
126 Using Kafka Connect 09:44
127 Outbox Pattern 04:31
128 Schema Registry 08:01
129 Using Schema Registry 08:10
130 Tiered Storage 03:28
131 [Exercise] Track Order Status Changes 04:27
132 [Exercise] Track Order Status Changes - Solution 05:06
133 Summary 04:41
134 Introduction 05:40
135 What Is Apache Flink? 05:24
136 Kafka Application 08:11
137 Multiple Streams 03:11
138 Installing Apache Flink 05:46
139 Processing Individual Records 07:22
140 [Exercise] Stream Processing 04:02
141 [Exercise] Stream Processing - Solution 02:40
142 Time Windows 06:49
143 Keyed Windows 02:40
144 Using Time Windows 05:18
145 Watermarks 10:06
146 Advanced Window Operations 06:17
147 Stateful Stream Processing 07:50
148 Using Local State 04:42
149 [Exercise] Anomalies Detection 04:35
150 [Exercise] Anomalies Detection - Solution 03:34
151 Joining Streams 05:50
152 Summary 03:10
153 Thank You! 01:18

Similar courses to The Data Engineering Bootcamp: Zero to Mastery

Statistics Bootcamp (with Python): Zero to Mastery

Statistics Bootcamp (with Python): Zero to Masteryzerotomastery.io

Category: Python, ChatGPT, Data processing and analysis
Duration 20 hours 50 minutes 51 seconds
dbt for Data Engineers

dbt for Data EngineersAndreas Kretz

Category: Data processing and analysis
Duration 1 hour 52 minutes 55 seconds
Dimensional Data Modeling

Dimensional Data ModelingEka Ponkratova

Category: Data processing and analysis
Duration 1 hour 37 minutes 57 seconds
Data Structures and Algorithmic Trading: Machine Learning

Data Structures and Algorithmic Trading: Machine Learningudemy

Category: Data processing and analysis
Duration 2 hours 20 minutes 32 seconds
Learning Apache Spark

Learning Apache SparkAndreas Kretz

Category: Data processing and analysis
Duration 1 hour 44 minutes 4 seconds
Azure Data Pipelines with Terraform

Azure Data Pipelines with TerraformAndreas Kretz

Category: Azure, Terraform, Data processing and analysis
Duration 4 hours 20 minutes 29 seconds
Complete Machine Learning and Data Science: Zero to Mastery

Complete Machine Learning and Data Science: Zero to Masteryudemyzerotomastery.io

Category: Data processing and analysis
Duration 43 hours 22 minutes 23 seconds
Mathematical Foundations of Machine Learning

Mathematical Foundations of Machine Learningudemy

Category: Python, Data processing and analysis
Duration 16 hours 25 minutes 26 seconds
Case Study in Product Data Science

Case Study in Product Data ScienceLunarTech

Category: Data processing and analysis
Duration 1 hour 4 minutes 47 seconds
Machine Learning Design Questions

Machine Learning Design Questionsalgoexpert

Category: Data processing and analysis
Duration 3 hours 3 minutes 57 seconds