Learn proven methods for quickly improving AI applications. Build AI systems that outperform competitors, regardless of the specific use case.
If you encounter questions like these while working with AI:
How to test applications where results are probabilistic and require subjective evaluation?
If I change a prompt, how can I ensure nothing else breaks?
Where should engineering efforts be directed? Is it necessary to test everything?
What to do if there is no data or users - where to start?
Which metrics should be tracked? What tools should be used? Which models should be selected?
Is it possible to automate testing and evaluation? And if yes, how can you trust it?
- then this course is for you.
This is a practical course for engineers and technical product managers. Ideal for those who know how to program or "enjoy coding by intuition."
What to Expect
You will experience intensive practice: exercises, working with code and data. We meet twice a week for four weeks + we offer generous office hours. All sessions are recorded and will be available in an asynchronous format.
Course Content
Basics and lifecycle of LLM application evaluation
Systematic error analysis
Building effective metrics and automated evaluation pipelines
Collaborative practices and alignment of evaluation criteria
Testing strategies for different architectures (RAG, pipelines, multimodal systems, etc.)
Monitoring in production and continuous quality evaluation
Organizing an effective human-in-the-loop review process
Cost optimization and query routing
Learning Outcomes
Master the best tools for finding, diagnosing, and prioritizing errors in AI.
Learn how to use synthetic data before user engagement and how to use real data as effectively as possible.
Build a "data flywheel" that ensures your AI improves over time.
Learn to automate parts of the evaluation processes and trust them.
Be able to customize AI to your preferences and requirements.
Avoid common mistakes accumulated from the experience of more than 35 AI projects.
Gain practical experience through end-to-end exercises, code, and analysis of real cases.
Hamel Husain is a machine learning engineer with over 20 years of experience. He has worked at innovative companies such as Airbnb and GitHub, where he was involved in early LLM research for code understanding, which was used in OpenAI. He is the author and contributor of numerous popular open-source machine learning tools. Currently, Hamel works as an independent consultant, helping companies develop AI products.
Shreya Shankar — Machine Learning Engineer and AI Researcher at UC Berkeley
Shreya Shankar is a machine learning engineer and PhD candidate in computer science at University of California, Berkeley. Her work focuses on building reliable AI systems, with a particular emphasis on large language models (LLMs), data quality, and evaluation frameworks.
Research Focus: Reliable and Practical AI Systems
Shreya Shankar develops systems that help developers and organizations effectively use AI for data-centric workflows. Her research areas include:
LLM evaluation and alignment with human preferences
Data quality and data-centric AI systems
Frameworks for building reliable machine learning pipelines
Human-in-the-loop AI systems
Her work bridges the gap between theoretical research and real-world AI applications.
Influential Research and Publications
Shreya has published papers at top-tier conferences, including:
SIGMOD (data management)
VLDB (Very Large Data Bases)
UIST (User Interface Software and Technology)
One of her most notable works is:
“Who Validates the Validators?” — a paper exploring how to align LLM evaluation systems with human judgment, addressing a key challenge in modern AI development.
Watch Online 41 lessons
This is a demo lesson (10:00 remaining)
You can watch up to 10 minutes for free. Subscribe to unlock all 41 lessons in this course and access 10,000+ hours of premium content across all courses.