Unlock the potential of modern data platforms with Apache Iceberg, which masterfully combines the flexibility of data lakes with the reliability of data warehouses. In this course, you will delve into the workings of this powerful open table format, explore its architecture, and harness its key features such as schema evolution, "time travel," and high-performance analytics within Lakehouse systems.
Course Overview
Based on hands-on examples from real-world data engineering, this course guides you through setting up a local lab with Docker, Spark, and MinIO, along with creating and managing Iceberg tables. You'll acquire the skills needed to handle tasks ranging from data recording and metadata analysis to query optimization and partition restructuring, preparing you for confident application of Iceberg in production environments.
Learning Outcomes
By concluding this course, you will not only grasp the internal structure of Iceberg but also possess a functional environment with ready-made notebooks for your projects and a deep understanding of table operations key to Lakehouse architecture.
Why Choose Iceberg?
Iceberg solves persistent big data challenges including slow queries, complex schema changes, and storage tightly coupled with computing systems. Discover why industry giants such as Netflix, Stripe, and Apple have adopted Iceberg and learn to integrate its methodologies into your own systems.
Course Activities
What you will do:
- Establish a local Lakehouse lab using Iceberg with Docker Compose, Spark, REST catalog, and MinIO.
- Get hands-on experience by creating an Iceberg table with an engaging dataset, define schemas, write data using PySpark, and explore Iceberg's metadata management.
- Gain expertise in schema evolution, including adding, renaming, and modifying column types, as well as employing advanced partitioning techniques.
- Perform point-in-time operations, like row deletions, and utilize the "time travel" feature to analyze past data versions.
- Explore Iceberg's architecture, covering parquet files, manifests, snapshots, and catalogs.
- Use the MinIO UI to visualize physical storage of data and metadata.
- Execute analytical SQL queries on Iceberg tables through PySpark using common operations such as join, group by, and filter.