Evals

Learn how the best teams, including Ramp, Notion, and OpenAI, ship quality AI products. In this course, you'll build a customer support chatbot from scratch and improve it with evals.

Start course GitHub

LLMs are probabilistic. The same prompt can return different answers on different runs, and one prompt tweak can improve one example but make others regress. Evals are how teams measure output quality and catch regressions before users do.

In this course, you'll build a customer chatbot and learn to build evals the way top teams do to ship reliable products. Specifically, you'll work on writing deterministic and LLM-as-judge scorers, learn to compare prompt variants, and turn production traces into test cases.

14 modules across three sections — Learn, Build, Refine - taking about an hour. No prior evals experience needed.

Learn2 modules

01Why are evals important?02What is an eval?

Build7 modules

03Simple eval using the UI 04Comparing experiments 05Playgrounds vs experiments 06Simple eval using the SDK 07How to deal with nondeterminism 08How to read a trace 09How to analyze your eval results

Refine5 modules

10Building a multi-turn chat app 11Analyzing multi-turn traces 12Online scoring 13Analyzing production logs 14The eval improvement loop

Evals

Trace everything