Webinar recap: Eval best practices

Ornella Altunyan

22 April 2025

Thanks to everyone who joined our webinar, In the Loop: Technical Q&A. Bryan Cox, our VP of Sales and Ankur Goyal, founder and CEO, hosted a technical session focused on evals, agents, and LLM observability.

If you missed it or want to revisit the details, you can watch the full webinar recording below.

Q: How do I get started building effective evals?

Keep it simple. Start with just about 10 examples and build a feedback loop. Don't worry about creating the perfect dataset from the start, just use real user feedback to iterate quickly.

Q: What are some common scoring functions teams start with?

Levenshtein distance: Just a simple string comparison metric.
Factuality: A popular OpenAI prompt-based evaluation that checks factual correctness.
Closed QA: Scores outputs against defined criteria without needing an exact correct answer.

Try these simpler methods first and adjust them based on discrepancies you see between automated and manual scores.

Q: What is Braintrust’s approach to multi-step prompt chaining (agents)?

We just added an agents feature in our playground UI, which makes chaining multiple prompts straightforward. You can already use loops, parallel branches, and external tool calls through our API, with SDK support available too. We're also working on an intuitive UI with visual annotations to make complex workflows easier.

Q: How do customers integrate evals into continuous integration (CI)?

Some of our advanced customers integrate Braintrust’s GitHub action into their CI. It smartly caches results, so only evals affected by recent code changes get rerun, which saves a ton of time and cost.

Q: How does Braintrust handle user feedback and PII in evals?

We're working on anonymization features to strip out personally identifiable information (PII), letting you safely incorporate real user feedback into your evals.

Q: Can Braintrust evaluate multimodal data (images, audio, video)?

We recently rolled out support for multimodal attachments in the playground, so you can directly upload and evaluate those datasets.

Q: How should teams balance automated scoring with human review?

Automated scoring helps flag interesting or tricky cases. Humans should focus on reviewing those flagged results. Plus, the playground lets non-technical users and subject matter experts directly refine prompts and scores, greatly improving your evaluation quality.

Q: What is Brainstore and why was it developed?

Brainstore is our logging database built specifically for large-scale LLM workloads. It solves issues like handling massive data volumes, large JSON logs, and rapid data growth. It scales easily on object storage, offers instant search, and drastically improves log management and observability.

Q: Have customers successfully used LLMs to automate scoring guidance?

You can, but usually as an initial step. Teams typically use LLMs to draft initial scoring criteria and then manually refine these criteria. This approach significantly speeds up the scoring process.

Q: What’s your advice on using synthetic data in evals?

Synthetic data can be useful, especially when real data is scarce or unavailable due to privacy or regulations. But you should always use synthetic data to complement, not replace, real user data.

The future of evals

In the coming years, evals will automate more tasks, better aligning AI outputs with human expectations. They'll become more sophisticated and involve more team members beyond just engineers, shifting from manual efforts toward continuous automated evaluation and improvement.

Stay connected with Braintrust for future events and insights: