Autoevals
Autoevals is a tool to quickly and easily evaluate AI model outputs.
It bundles together a variety of automatic evaluation methods including:
- LLM-as-a-Judge
- Heuristic (e.g. Levenshtein distance)
- Statistical (e.g. BLEU)
Autoevals uses model-graded evaluation for a variety of subjective tasks including fact checking, safety, and more. Many of these evaluations are adapted from OpenAI's excellent evals project but are implemented so you can flexibly run them on individual examples, tweak the prompts, and debug their outputs.
You can also create your own model-graded evaluations with Autoevals. It's easy to add custom prompts, parse outputs, and manage exceptions. Autoevals is an open source library available on GitHub.
Installation
Autoevals is distributed as a Python library on PyPI and Node.js library on NPM.
or
Example
Use Autoevals to model-grade an example LLM completion using the factuality prompt.
By default, Autoevals uses your OPENAI_API_KEY
environment variable to authenticate with OpenAI's API.
Using Braintrust with Autoevals
Once you grade an output using Autoevals, it's convenient to use Braintrust to log and compare your evaluation results.
Create a file named example.eval.js
(it must take the form *.eval.[ts|tsx|js|jsx]
):
Then, run
Supported Evaluation Methods
LLM-as-a-Judge
- Battle:
Test whether an output better performs the
instructions
than the original (expected) value. - ClosedQA:
Test whether an output answers the
input
using knowledge built into the model. You can specifycriteria
to further constrain the answer. - Humor:
Test whether an output is funny.
- Factuality:
Test whether an output is factual, compared to an original (
expected
) value. - Moderation:
A scorer that uses OpenAI's moderation API to determine if AI response contains ANY flagged content.
- Possible:
Test whether an output is a possible solution to the challenge posed in the input.
- Security:
Test whether an output is malicious.
- Sql:
Test whether a SQL query is semantically the same as a reference (output) query.
- Summary:
Test whether an output is a better summary of the
input
than the original (expected
) value. - Translation:
Test whether an
output
is as good of a translation of theinput
in the specifiedlanguage
as an expert (expected
) value.
RAG
- ContextEntityRecall:
Estimates context recall by estimating TP and FN using annotated answer and retrieved context.
- ContextRelevancy:
Extracts relevant sentences from the provided context that are absolutely required to answer the given question.
- ContextRecall:
Analyzes each sentence in the answer and classifies if the sentence can be attributed to the given context or not.
- ContextPrecision:
Verifies if the context was useful in arriving at the given answer.
- AnswerRelevancy:
Scores the relevancy of the generated answer to the given question.
- AnswerSimilarity:
Scores the semantic similarity between the generated answer and ground truth.
- AnswerCorrectness:
Measures answer correctness compared to ground truth using a weighted average of factuality and semantic similarity.
Composite
- ListContains:
Semantically evaluates the overlap between two lists of strings using pairwise similarity and Linear Sum Assignment.
- ValidJSON:
Evaluates the validity of JSON output, optionally validating against a JSON Schema definition.
Embeddings
- EmbeddingSimilarity:
Evaluates the semantic similarity between two embeddings using cosine distance.
Heuristic
- JSONDiff:
Compares JSON objects using customizable comparison methods for strings and numbers.
- Levenshtein:
Uses the Levenshtein distance to compare two strings.
- ExactMatch:
Compares two values for exact equality. If the values are objects, they are converted to JSON strings before comparison.
- NumericDiff:
Compares numbers by normalizing their difference.
Custom evaluation prompts
Autoevals supports custom evaluation prompts for model-graded evaluation. To use them, simply pass in a prompt and scoring mechanism:
Creating custom scorers
You can also create your own scoring functions that do not use LLMs. For example, to test whether the word 'banana'
is in the output, you can use the following:
Why does this library exist?
There is nothing particularly novel about the evaluation methods in this library. They are all well-known and well-documented. However, there are a few things that are particularly difficult when evaluating in practice:
- Normalizing metrics between 0 and 1 is tough. For example, check out the calculation in number.py to see how it's done for numeric differences.
- Parsing the outputs on model-graded evaluations is also challenging. There are frameworks that do this, but it's hard to debug one output at a time, propagate errors, and tweak the prompts. Autoevals makes these tasks easy.
- Collecting metrics behind a uniform interface makes it easy to swap out evaluation methods and compare them. Prior to Autoevals, we couldn't find an open source library where you can simply pass in
input
,output
, andexpected
values through a bunch of different evaluation methods.