Learn proven methods for quickly improving AI applications. Build AI systems that outperform competitors, regardless of the specific use case.
If you encounter questions like these while working with AI:
How to test applications where results are probabilistic and require subjective evaluation?
If I change a prompt, how can I ensure nothing else breaks?
Where should engineering efforts be directed? Is it necessary to test everything?
What to do if there is no data or users - where to start?
Which metrics should be tracked? What tools should be used? Which models should be selected?
Is it possible to automate testing and evaluation? And if yes, how can you trust it?
- then this course is for you.
This is a practical course for engineers and technical product managers. Ideal for those who know how to program or "enjoy coding by intuition."
What to Expect
You will experience intensive practice: exercises, working with code and data. We meet twice a week for four weeks + we offer generous office hours. All sessions are recorded and will be available in an asynchronous format.
Course Content
Basics and lifecycle of LLM application evaluation
Systematic error analysis
Building effective metrics and automated evaluation pipelines
Collaborative practices and alignment of evaluation criteria
Testing strategies for different architectures (RAG, pipelines, multimodal systems, etc.)
Monitoring in production and continuous quality evaluation
Organizing an effective human-in-the-loop review process
Cost optimization and query routing
Learning Outcomes
Master the best tools for finding, diagnosing, and prioritizing errors in AI.
Learn how to use synthetic data before user engagement and how to use real data as effectively as possible.
Build a "data flywheel" that ensures your AI improves over time.
Learn to automate parts of the evaluation processes and trust them.
Be able to customize AI to your preferences and requirements.
Avoid common mistakes accumulated from the experience of more than 35 AI projects.
Gain practical experience through end-to-end exercises, code, and analysis of real cases.
Hamel Husain is a machine learning engineer with over 20 years of experience. He has worked at companies like Airbnb and GitHub. At GitHub, he worked on early LLM research for code understanding that later supported work at OpenAI.
Hamel has created and contributed to many open-source machine learning tools. He now works as an independent consultant and helps companies build AI products.
Shreya Shankar — Machine Learning Engineer and AI Researcher
Shreya Shankar is a machine learning engineer and a PhD candidate in computer science at the University of California, Berkeley. She studies how to build reliable AI systems with a focus on large language models (LLMs), data quality, and clear evaluation methods.
Research Focus
Shreya designs systems that help you use AI in data‑centric work. She looks at how to make AI tools stable, clear, and safe to use. Her main research areas include:
LLM evaluation and alignment with human choices
Data quality in data‑centric AI systems
Tools for building reliable machine learning pipelines
Human‑in‑the‑loop AI systems
Her work connects theory with real-world AI use.
Research and Publications
Shreya has published work at top computer science venues, including:
SIGMOD
VLDB
UIST
One well‑known paper is:
“Who Validates the Validators?” — a study on how to align LLM evaluation systems with human judgment.
Watch Online 41 lessons
This is a demo lesson (10:00 remaining)
You can watch up to 10 minutes for free. Subscribe to unlock all 41 lessons in this course and access 10,000+ hours of premium content across all courses.