Learn proven methods for quickly improving AI applications. Build AI systems that outperform competitors, regardless of the specific use case.
If you encounter questions like these while working with AI:
How to test applications where results are probabilistic and require subjective evaluation?
If I change a prompt, how can I ensure nothing else breaks?
Where should engineering efforts be directed? Is it necessary to test everything?
What to do if there is no data or users - where to start?
Which metrics should be tracked? What tools should be used? Which models should be selected?
Is it possible to automate testing and evaluation? And if yes, how can you trust it?
- then this course is for you.
This is a practical course for engineers and technical product managers. Ideal for those who know how to program or "enjoy coding by intuition."
What to Expect
You will experience intensive practice: exercises, working with code and data. We meet twice a week for four weeks + we offer generous office hours. All sessions are recorded and will be available in an asynchronous format.
Course Content
Basics and lifecycle of LLM application evaluation
Systematic error analysis
Building effective metrics and automated evaluation pipelines
Collaborative practices and alignment of evaluation criteria
Testing strategies for different architectures (RAG, pipelines, multimodal systems, etc.)
Monitoring in production and continuous quality evaluation
Organizing an effective human-in-the-loop review process
Cost optimization and query routing
Learning Outcomes
Master the best tools for finding, diagnosing, and prioritizing errors in AI.
Learn how to use synthetic data before user engagement and how to use real data as effectively as possible.
Build a "data flywheel" that ensures your AI improves over time.
Learn to automate parts of the evaluation processes and trust them.
Be able to customize AI to your preferences and requirements.
Avoid common mistakes accumulated from the experience of more than 35 AI projects.
Gain practical experience through end-to-end exercises, code, and analysis of real cases.
Hamel Husain is a machine learning engineer with over 20 years of experience. He has worked at innovative companies such as Airbnb and GitHub, where he was involved in early LLM research for code understanding, which was used in OpenAI. He is the author and contributor of numerous popular open-source machine learning tools. Currently, Hamel works as an independent consultant, helping companies develop AI products.
Shreya Shankar is a machine learning engineer and a PhD candidate in computer science at UC Berkeley, where she develops systems to help people effectively use AI for data work. Her research focuses on creating practical tools and frameworks for building reliable ML systems, including recent groundbreaking work on LLM evaluation and data quality. She has published influential papers on evaluating and aligning LLM systems, including “Who Validates the Validators?”, which examines how to systematically align LLM assessments with human preferences.
Before starting her PhD, Shreya worked as an ML engineer in the industry and received her bachelor's and master's degrees in computer science from Stanford. Her work has been published at leading conferences on data management and HCI, including SIGMOD, VLDB, and UIST. She is currently an NDSEG Fellowship recipient and actively collaborates with major tech companies and startups, implementing her research in real-world production environments. Her recent projects, such as DocETL and SPADE, demonstrate her ability to bridge theoretical foundations with practical solutions that help developers create more reliable AI systems.
Watch Online 41 lessons
This is a demo lesson (10:00 remaining)
You can watch up to 10 minutes for free. Subscribe to unlock all 41 lessons in this course and access 10,000+ hours of premium content across all courses.