The Data Engineering Bootcamp: Zero to Mastery
Learn how to build streaming pipelines with Apache Kafka and Flink, create data lakes on AWS, run ML workflows on Spark, and integrate LLM models into production systems. This course is designed to kickstart your career and make you a sought-after data engineer of tomorrow.
Read more about the course
Why is Data Engineering the new major profession in IT?
Data Engineering is rapidly becoming one of the fastest-growing and in-demand professions in the tech world. With the rise of AI products, analytical systems, and real-time applications, companies are actively developing their data infrastructures, which drives the demand for specialists.
Just last year, more than 20,000 new data engineer positions were created, and the total number of open positions in North America approached 150,000, clearly demonstrating the explosive growth of the industry.
Moreover, the salaries are impressive:
- Entry level - from $80,000 to $110,000 per year
- Mid and senior level - up to $190,000–$200,000+
Furthermore, data engineers play a strategic role: they build the foundation for machine learning systems, analytics, and AI, without which modern tech products are impossible. With the further growth of AI, the demand for data engineers will only increase, creating excellent opportunities for a long-term career and financial stability.
Why this particular bootcamp?
Our bootcamp is designed to be as comprehensive and practical as possible, without unnecessary theory or outdated tutorials. You will learn step by step and build real projects using the same tools that professionals use.
You will start with Apache Spark, processing real Airbnb data and mastering large-scale computations. Then you will create a modern data lake on AWS using S3, EMR, Glue, and Athena. You will learn pipeline orchestration with Apache Airflow, build streaming systems on Kafka and Flink, and even integrate machine learning and LLM (Large Language Models) directly into the pipelines.
As a result, you will learn to build end-to-end production-level systems - the exact skills employers are looking for.
What's inside the course?
- Introduction to Data Engineering
- Understand how modern data engineering works and what is needed to start.
- Big Data Processing with Apache Spark
- Learn to work with large datasets using DataFrame API, UDF, aggregations, and optimization.
- Building a data lake on AWS
- Build scalable data storage using S3, EMR, and Athena.
- Pipelines with Apache Airflow
- Automate and manage tasks, handle errors, schedule and run Spark jobs.
- ML with Spark MLlib
- Embed machine learning in your pipelines - classification, regression, model selection.
- AI and LLM in Data Engineering
- Use Hugging Face and other tools to integrate LLM in data processing.
- Stream Processing with Apache Kafka and Flink
- Create real-time systems, process events, work with real-time streams.
Outcome
After completing the course, you won’t just have watched videos - you'll become a true data engineer, ready to build systems that companies need today.
Thousands of our graduates already work at Google, Tesla, Amazon, Apple, IBM, JP Morgan, Facebook, Shopify, and other top companies.
Many of them started from scratch. So why not become the next one?
Watch Online The Data Engineering Bootcamp: Zero to Mastery
# | Title | Duration |
---|---|---|
1 | The Data Engineering Bootcamp: Zero to Mastery | 01:35 |
2 | Introduction to Data Engineering | 04:17 |
3 | Who Are Data Engineers? | 04:43 |
4 | Prerequisites | 03:19 |
5 | Source Code for This Bootcamp | 01:19 |
6 | Plan for This Bootcamp | 04:38 |
7 | [Optional] What Is a Virtualenv? | 06:37 |
8 | [Optional] What Is Docker? | 11:03 |
9 | Introduction | 04:08 |
10 | Apache Spark | 03:44 |
11 | How Spark Works | 04:24 |
12 | Spark Application | 07:41 |
13 | DataFrames | 06:43 |
14 | Installing Spark | 05:51 |
15 | Inside Airbnb Data | 07:02 |
16 | Writing Your First Spark Job | 07:05 |
17 | Lazy Processing | 02:16 |
18 | [Exercise] Basic Functions | 01:29 |
19 | [Exercise] Basic Functions - Solution | 06:41 |
20 | Aggregating Data | 04:00 |
21 | Joining Data | 04:40 |
22 | Aggregations and Joins with Spark | 06:10 |
23 | Complex Data Types | 05:09 |
24 | [Exercise] Aggregate Functions | 00:50 |
25 | [Exercise] Aggregate Functions - Solution | 05:54 |
26 | User Defined Functions | 03:25 |
27 | Data Shuffle | 06:14 |
28 | Data Accumulators | 03:42 |
29 | Optimizing Spark Jobs | 07:39 |
30 | Submitting Spark Jobs | 04:29 |
31 | Other Spark APIs | 05:16 |
32 | Spark SQL | 04:33 |
33 | [Exercise] Advanced Spark | 02:10 |
34 | [Exercise] Advanced Spark - Solution | 05:26 |
35 | Summary | 03:08 |
36 | Introduction | 04:26 |
37 | What Is a Data Lake? | 09:08 |
38 | Amazon Web Services (AWS) | 07:47 |
39 | Simple Storage Service (S3) | 05:45 |
40 | Setting Up an AWS Account | 09:29 |
41 | Data Partitioning | 03:24 |
42 | Using S3 | 07:49 |
43 | EMR Serverless | 02:59 |
44 | IAM Roles | 02:52 |
45 | Running a Spark Job | 08:49 |
46 | Parquet Data Format | 07:41 |
47 | Implementing a Data Catalog | 05:32 |
48 | Data Catalog Demo | 06:42 |
49 | Querying a Data Lake | 04:00 |
50 | Summary | 03:39 |
51 | Introduction | 05:53 |
52 | What Is Apache Airflow? | 05:19 |
53 | Airflow’s Architecture | 03:15 |
54 | Installing Airflow | 06:33 |
55 | Defining an Airflow DAG | 08:03 |
56 | Errors Handling | 03:38 |
57 | Idempotent Tasks | 04:54 |
58 | Creating a DAG - Part 1 | 04:58 |
59 | Creating a DAG - Part 2 | 04:42 |
60 | Handling Failed Tasks | 04:09 |
61 | [Exercise] Data Validation | 04:31 |
62 | [Exercise] Data Validation - Solution | 03:27 |
63 | Spark with Airflow | 03:02 |
64 | Using Spark with Airflow - Part 1 | 07:39 |
65 | Using Spark with Airflow - Part 2 | 05:52 |
66 | Sensors In Airflow | 04:46 |
67 | Using File Sensors | 04:08 |
68 | Data Ingestion | 05:50 |
69 | Reading Data From Postgres - Part 1 | 06:03 |
70 | Reading Data from Postgres - Part 2 | 05:40 |
71 | [Exercise] Average Customer Review | 03:53 |
72 | [Exercise] Average Customer Review - Solution | 04:33 |
73 | Advanced DAGs | 04:26 |
74 | Summary | 02:27 |
75 | Introduction | 05:28 |
76 | What Is Machine Learning | 06:06 |
77 | Regression Algorithms | 05:38 |
78 | Building a Regression Model | 05:04 |
79 | Training a Model | 09:46 |
80 | Model Evaluation | 07:26 |
81 | Testing a Regression Model | 03:57 |
82 | Model Lifecycle | 02:12 |
83 | Feature Engineering | 08:44 |
84 | Improving a Regression Model | 07:34 |
85 | Machine Learning Pipelines | 03:56 |
86 | Creating a Pipeline | 02:41 |
87 | [Exercise] House Price Estimation | 01:59 |
88 | [Exercise] House Price Estimation - Solution | 03:12 |
89 | [Exercise] Imposter Syndrome | 02:57 |
90 | Classification | 07:37 |
91 | Classifiers Evaluation | 04:27 |
92 | Training a Classifier | 08:31 |
93 | Hyperparameters | 08:06 |
94 | Optimizing a Model | 03:02 |
95 | [Exercise] Loan Approval | 02:34 |
96 | [Exercise] Load Approval - Solution | 02:33 |
97 | Deep Learning | 06:56 |
98 | Summary | 03:23 |
99 | Introduction | 05:07 |
100 | Natural Language Processing (NLP) before LLMs | 06:11 |
101 | Transformers | 06:21 |
102 | Types of LLMs | 07:40 |
103 | Hugging Face | 02:19 |
104 | Databricks Set Up | 10:38 |
105 | Using an LLM | 07:36 |
106 | Structured Output | 03:42 |
107 | Producing JSON Output | 05:10 |
108 | LLMs With Apache Spark | 05:20 |
109 | Summary | 02:48 |
110 | Introduction | 06:06 |
111 | What Is Apache Kafka? | 07:00 |
112 | Partitioning Data | 08:56 |
113 | Kafka API | 07:42 |
114 | Kafka Architecture | 03:15 |
115 | Set Up Kafka | 05:53 |
116 | Writing to Kafka | 06:07 |
117 | Reading from Kafka | 07:37 |
118 | Data Durability | 06:39 |
119 | Kafka vs Queues | 02:11 |
120 | [Exercise] Processing Records | 03:44 |
121 | [Exercise] Processing Records - Solution | 02:59 |
122 | Delivery Semantics | 05:53 |
123 | Kafka Transactions | 04:34 |
124 | Log Compaction | 03:23 |
125 | Kafka Connect | 06:59 |
126 | Using Kafka Connect | 09:44 |
127 | Outbox Pattern | 04:31 |
128 | Schema Registry | 08:01 |
129 | Using Schema Registry | 08:10 |
130 | Tiered Storage | 03:28 |
131 | [Exercise] Track Order Status Changes | 04:27 |
132 | [Exercise] Track Order Status Changes - Solution | 05:06 |
133 | Summary | 04:41 |
134 | Introduction | 05:40 |
135 | What Is Apache Flink? | 05:24 |
136 | Kafka Application | 08:11 |
137 | Multiple Streams | 03:11 |
138 | Installing Apache Flink | 05:46 |
139 | Processing Individual Records | 07:22 |
140 | [Exercise] Stream Processing | 04:02 |
141 | [Exercise] Stream Processing - Solution | 02:40 |
142 | Time Windows | 06:49 |
143 | Keyed Windows | 02:40 |
144 | Using Time Windows | 05:18 |
145 | Watermarks | 10:06 |
146 | Advanced Window Operations | 06:17 |
147 | Stateful Stream Processing | 07:50 |
148 | Using Local State | 04:42 |
149 | [Exercise] Anomalies Detection | 04:35 |
150 | [Exercise] Anomalies Detection - Solution | 03:34 |
151 | Joining Streams | 05:50 |
152 | Summary | 03:10 |
153 | Thank You! | 01:18 |
Similar courses to The Data Engineering Bootcamp: Zero to Mastery

Business Intelligence with Excelzerotomastery.io

PyTorch for Deep Learning with Python Bootcampudemy

Data Platform & Pipeline DesignAndreas Kretz

Building APIs with FastAPIAndreas Kretz

Complete linear algebra: theory and implementationudemy

Data Analysis for Beginners: Python & Statisticszerotomastery.io

Data Analysis for Beginners: Excel & Pivot Tableszerotomastery.io

Choosing Data StoresAndreas Kretz

Becoming a Better Data EngineerAndreas Kretz
