Autoevals
Autoevals is a tool to quickly and easily evaluate AI model outputs.
It bundles together a variety of automatic evaluation methods including:
- LLM-as-a-judge
- Heuristic (e.g. Levenshtein distance)
- Statistical (e.g. BLEU)
Autoevals is developed by the team at Braintrust.
Autoevals uses model-graded evaluation for a variety of subjective tasks including fact checking, safety, and more. Many of these evaluations are adapted from OpenAI's excellent evals project but are implemented so you can flexibly run them on individual examples, tweak the prompts, and debug their outputs.
You can also create your own model-graded evaluations with Autoevals. It's easy to add custom prompts, parse outputs, and manage exceptions.
Requirements
- Python 3.9 or higher
- Compatible with both OpenAI Python SDK v0.x and v1.x
Installation
Getting started
Use Autoevals to model-grade an example LLM completion using the Factuality prompt.
By default, Autoevals uses your OPENAI_API_KEY
environment variable to authenticate with OpenAI's API.
Using other AI providers
When you use Autoevals, it will look for an OPENAI_BASE_URL
environment variable to use as the base for requests to an OpenAI compatible API. If OPENAI_BASE_URL
is not set, it will default to the AI proxy.
If you choose to use the proxy, you'll also get:
- Simplified access to many AI providers
- Reduced costs with automatic request caching
- Increased observability when you enable logging to Braintrust
The proxy is free to use, even if you don't have a Braintrust account.
If you have a Braintrust account, you can optionally set the BRAINTRUST_API_KEY
environment variable instead of OPENAI_API_KEY
to unlock additional features like logging and monitoring. You can also route requests to supported AI providers and models or custom models you have configured in Braintrust.
Custom client configuration
There are two ways you can configure a custom client when you need to use a different OpenAI compatible API:
- Global configuration: Initialize a client that will be used by all evaluators
- Instance configuration: Configure a client for a specific evaluator
Global configuration
Set up a client that all your evaluators will use:
Instance configuration
Configure a client for a specific evaluator instance:
Using Braintrust with Autoevals (optional)
Once you grade an output using Autoevals, you can optionally use Braintrust to log and compare your evaluation results. This integration is completely optional and not required for using Autoevals.
Create a file named example.eval.js
(it must take the form *.eval.[ts|tsx|js|jsx]
):
Then, run
Supported evaluation methods
LLM-as-a-judge evaluations
- Battle
- Closed QA
- Humor
- Factuality
- Moderation
- Security
- Summarization
- SQL
- Translation
- Fine-tuned binary classifiers
RAG evaluations
- Context precision
- Context relevancy
- Context recall
- Context entity recall
- Faithfulness
- Answer relevancy
- Answer similarity
- Answer correctness
Composite evaluations
- Semantic list contains
- JSON validity
Embedding evaluations
- Embedding similarity
Heuristic evaluations
- Levenshtein distance
- Exact match
- Numeric difference
- JSON diff
Custom evaluation prompts
Autoevals supports custom evaluation prompts for model-graded evaluation. To use them, simply pass in a prompt and scoring mechanism:
Creating custom scorers
You can also create your own scoring functions that do not use LLMs. For example, to test whether the word 'banana'
is in the output, you can use the following:
Why does this library exist?
There is nothing particularly novel about the evaluation methods in this library. They are all well-known and well-documented. However, there are a few things that are particularly difficult when evaluating in practice:
- Normalizing metrics between 0 and 1 is tough. For example, check out the calculation in number.py to see how it's done for numeric differences.
- Parsing the outputs on model-graded evaluations is also challenging. There are frameworks that do this, but it's hard to debug one output at a time, propagate errors, and tweak the prompts. Autoevals makes these tasks easy.
- Collecting metrics behind a uniform interface makes it easy to swap out evaluation methods and compare them. Prior to Autoevals, we couldn't find an open source library where you can simply pass in
input
,output
, andexpected
values through a bunch of different evaluation methods.
Documentation
The full docs are available for your reference.
Contributing
We welcome contributions!
To install the development dependencies, run make develop
, and run source env.sh
to activate the environment. Make a .env
file from the .env.example
file and set the environment variables. Run direnv allow
to load the environment variables.
To run the tests, run pytest
from the root directory.
Send a PR and we'll review it! We'll take care of versioning and releasing.