autoevals
Autoevals is a comprehensive toolkit for evaluating AI model outputs.
This library provides a collection of specialized scorers for different types of evaluations:
string
: Text similarity using edit distance or embeddingsllm
: LLM-based evaluation for correctness, complexity, security, etc.moderation
: Content safety and policy compliance checksragas
: Advanced NLP metrics for RAG system evaluationjson
: JSON validation and structural comparisonnumber
: Numeric similarity with relative scalingvalue
: Exact matching and basic comparisons
Key features:
- Both sync and async evaluation support
- Configurable scoring parameters
- Detailed feedback through metadata
- Integration with OpenAI and other LLM providers through Braintrust AI Proxy
Client setup:
There are two ways to configure the OpenAI client:
- Global initialization (recommended):
- Per-evaluator initialization:
Multi-provider support via the Braintrust AI Proxy:
Autoevals supports multiple LLM providers (Anthropic, Azure, etc.) through the Braintrust AI Proxy. Configure your client to use the proxy:
Braintrust integration:
Autoevals automatically integrates with Braintrust logging when you install the library. If needed, you can manually wrap the client:
Example Autoevals usage:
See individual module documentation for detailed usage and options.
autoevals.llm
LLM-based evaluation scorers for assessing model outputs.
This module provides a collection of pre-built LLM scorers for common evaluation tasks.
All evaluators accept the following common arguments:
- model: Model to use (defaults to gpt-4)
- temperature: Controls randomness (0-1, defaults to 0)
- client: OpenAI client (defaults to global client from init())
Example:
LLMClassifier
High-level classifier for evaluating text using LLMs.
This is the main class for building custom classifiers. It provides:
- Chain of thought reasoning for better accuracy
- Standardized output parsing
- Template-based prompts
- YAML configuration support
- Flexible scoring rules
Example:
Arguments:
name
- Classifier name for trackingprompt_template
- Template for generating prompts (supports{{output}}
,{{expected}}
, etc.)choice_scores
- Mapping of choices to scores (e.g.{"good": 1, "bad": 0}
)model
- Model to use. Defaults to DEFAULT_MODEL.use_cot
- Enable chain of thought reasoning. Defaults to True.max_tokens
- Maximum tokens to generate. Defaults to 512.temperature
- Controls randomness (0-1). Defaults to 0.engine
- Deprecated by OpenAI. Use model instead.api_key
- Deprecated. Use client instead.base_url
- Deprecated. Use client instead.client
- OpenAI client. If not provided, uses global client from init().**extra_render_args
- Additional template variables
Battle
Compare if a solution performs better than a reference solution.
This evaluator uses LLM-based comparison to determine if a generated solution is better than a reference solution, considering factors like:
- Code quality and readability
- Algorithm efficiency and complexity
- Implementation completeness
- Best practices and patterns
- Error handling and edge cases
Example:
Arguments:
instructions
- Problem description or task requirements that both solutions should addressoutput
- Solution to evaluate (code, text, or other content)expected
- Reference solution to compare against
Returns:
Score object with:
- score: 1 if output solution is better, 0 if worse
- metadata.rationale: Detailed explanation of the comparison
- metadata.choice: Selected choice (better/worse)
ClosedQA
Evaluate answer correctness using the model's knowledge.
Example:
Arguments:
input
- Question to evaluateoutput
- Answer to assesscriteria
- Optional evaluation criteria
Humor
Rate the humor level in text.
Example:
Arguments:
output
- Text to evaluate for humor
Factuality
Check factual accuracy against a reference.
Example:
Arguments:
output
- Text to checkexpected
- Reference text with correct facts
Possible
Evaluate if a solution is feasible and practical.
Example:
Arguments:
input
- Problem descriptionoutput
- Proposed solution
Security
Evaluate if a solution has security vulnerabilities.
This evaluator uses LLM-based analysis to identify potential security issues in code or system designs, checking for common vulnerabilities like:
- Injection attacks (SQL, command, etc.)
- Authentication/authorization flaws
- Data exposure risks
- Input validation issues
- Unsafe dependencies
- Insecure configurations
- Common OWASP vulnerabilities
Example:
Arguments:
instructions
- Context or requirements for the security evaluationoutput
- Code or system design to evaluate for security issues
Returns:
Score object with:
- score: 1 if secure, 0 if vulnerable
- metadata.rationale: Detailed security analysis
- metadata.choice: Selected choice (secure/vulnerable)
- metadata.vulnerabilities: List of identified security issues
Sql
Compare if two SQL queries are equivalent.
Example:
Arguments:
output
- SQL query to checkexpected
- Reference SQL query
Summary
Evaluate text summarization quality.
Example:
Arguments:
input
- Original textoutput
- Generated summaryexpected
- Reference summary
Translation
Evaluate translation quality.
Example:
Arguments:
input
- Source textoutput
- Translation to evaluateexpected
- Reference translationlanguage
- Target language
autoevals.ragas
This module provides evaluators for assessing the quality of context retrieval and answer generation. These metrics are ported from the RAGAS project with some enhancements.
Context quality evaluators:
ContextEntityRecall
: Measures how well context contains expected entitiesContextRelevancy
: Evaluates relevance of context to questionContextRecall
: Checks if context supports expected answerContextPrecision
: Measures precision of context relative to question
Answer quality evaluators:
Faithfulness
: Checks if answer claims are supported by contextAnswerRelevancy
: Measures answer relevance to questionAnswerSimilarity
: Compares semantic similarity to expected answerAnswerCorrectness
: Evaluates factual correctness against ground truth
Common arguments:
model
: Model to use for evaluation, defaults to DEFAULT_RAGAS_MODEL (gpt-3.5-turbo-16k)client
: Optional Client for API calls. If not provided, uses global client from init()
Example:
For more examples and detailed usage of each evaluator, see their individual class docstrings.
ContextEntityRecall
Measures how well the context contains the entities mentioned in the expected answer.
Example:
Arguments:
expected
- The expected/ground truth answer containing entities to findcontext
- The context document(s) to search for entities in
ContextRelevancy
Evaluates how relevant the context is to the input question.
Example:
Arguments:
input
- The question being evaluatedoutput
- The generated answercontext
- The context document(s) to evaluate
ContextRecall
Measures how well the context supports the expected answer.
Example:
Arguments:
input
- The question being evaluatedoutput
- The generated answerexpected
- The expected/ground truth answercontext
- The context document(s) to evaluate
ContextPrecision
Measures how precise and focused the context is for answering the question.
Example:
Arguments:
input
- The question being evaluatedoutput
- The generated answerexpected
- The expected/ground truth answercontext
- The context document(s) to evaluate
Faithfulness
Evaluates if the generated answer is faithful to the given context.
Example:
Arguments:
input
- The question being evaluatedoutput
- The generated answer to evaluatecontext
- The context document(s) to evaluate against
AnswerRelevancy
Evaluates how relevant the generated answer is to the input question.
Example:
Arguments:
input
- The question being evaluatedoutput
- The generated answer to evaluatecontext
- The context document(s) to evaluate againststrictness
- Optional float between 0-1, higher values enforce stricter relevancytemperature
- Optional float between 0-1, lower values make evaluation more deterministicembedding_model
- Optional model to use for embeddings, defaults to text-embedding-3-small
AnswerSimilarity
Evaluates how semantically similar the generated answer is to the expected answer.
Example:
Arguments:
output
- The generated answer to evaluateexpected
- The expected/ground truth answerembedding_model
- Optional model to use for embeddings, defaults to text-embedding-3-small
AnswerCorrectness
Evaluates how correct the generated answer is compared to the expected answer.
Example:
Arguments:
input
- The question being evaluatedoutput
- The generated answer to evaluateexpected
- The expected/ground truth answerfactuality_weight
- Optional float between 0-1 for factual correctness weightanswer_similarity_weight
- Optional float between 0-1 for answer similarity weightanswer_similarity
- Optional AnswerSimilarity instance for similarity evaluation
autoevals.moderation
Moderation
A scorer that evaluates if AI responses contain inappropriate or unsafe content.
This scorer uses OpenAI's moderation API to check for potentially harmful content in text. It returns a score of 0 if content is flagged as inappropriate, and 1 if it passes moderation.
Arguments:
threshold
- Optional float to set a custom threshold for content flagging. If None, uses OpenAI's default flagging logic. If set, content is flagged if any category score exceeds this threshold.client
- Optional client for making API calls. Can be:- None: Uses global client from init()
- OpenAI v1 client: Used directly
- OpenAI v0 module: Wrapped in a client adapter
Example:
__init__
Initialize a Moderation scorer.
Arguments:
threshold
- Optional float to set a custom threshold for content flagging. If None, uses OpenAI's default flagging logic. If set, content is flagged if any category score exceeds this threshold.client
- Optional client for making API calls. Can be:- None: Uses global client from init()
- OpenAI v1 client: Used directly
- OpenAI v0 module: Wrapped in a client adapter
api_key
- Deprecated. Use client instead.base_url
- Deprecated. Use client instead.
Notes:
The api_key and base_url parameters are deprecated and will be removed in a future version. Instead, you can either:
- Pass a client instance directly to this constructor using the client parameter
- Set a global client using autoevals.init(client=your_client)
The global client can be configured once and will be used by all evaluators that don't have a specific client passed to them.
autoevals.string
String evaluation scorers for comparing text similarity.
This module provides scorers for text comparison:
-
Levenshtein: Compare strings using edit distance
- Fast, local string comparison
- Suitable for exact matches and small variations
- No external dependencies
- Simple to use with just output/expected parameters
-
EmbeddingSimilarity: Compare strings using embeddings
- Semantic similarity using embeddings
- Requires OpenAI API access
- Better for comparing meaning rather than exact matches
- Supports both sync and async evaluation
- Built-in caching for efficiency
- Configurable with options for model, prefix, thresholds
Levenshtein
String similarity scorer using edit distance.
Example:
Arguments:
output
- String to evaluateexpected
- Reference string to compare against
Returns:
Score object with normalized similarity (0-1), where 1 means identical strings
EmbeddingSimilarity
String similarity scorer using embeddings.
Example:
Arguments:
prefix
- Optional text to prepend to inputs for domain contextmodel
- Embedding model to use (default: text-embedding-ada-002)expected_min
- Minimum similarity threshold (default: 0.7)client
- Optional AsyncOpenAI/OpenAI client. If not provided, uses global client from init()
Returns:
Score object with:
- score: Normalized similarity (0-1)
- metadata: Additional comparison details
autoevals.number
Numeric evaluation scorers for comparing numerical values.
This module provides scorers for working with numbers:
- NumericDiff: Compare numbers using normalized difference, providing a similarity score that accounts for both absolute and relative differences between values.
Features:
- Normalized scoring between 0 and 1
- Handles special cases like comparing zeros
- Accounts for magnitude when computing differences
- Suitable for both small and large number comparisons
NumericDiff
Numeric similarity scorer using normalized difference.
Example:
Arguments:
output
- Number to evaluateexpected
- Reference number to compare against
Returns:
Score object with normalized similarity (0-1), where:
- 1 means identical numbers
- Score decreases as difference increases relative to magnitude
- Special case: score=1 when both numbers are 0
autoevals.json
JSON evaluation scorers for comparing and validating JSON data.
This module provides scorers for working with JSON data:
-
JSONDiff: Compare JSON objects for structural and content similarity
- Handles nested structures, strings, numbers
- Customizable with different scorers for string and number comparisons
- Can automatically parse JSON strings
-
ValidJSON: Validate if a string is valid JSON and matches an optional schema
- Validates JSON syntax
- Optional JSON Schema validation
- Works with both strings and parsed objects
JSONDiff
Compare JSON objects for structural and content similarity.
This scorer recursively compares JSON objects, handling:
- Nested dictionaries and lists
- String similarity using Levenshtein distance
- Numeric value comparison
- Automatic parsing of JSON strings
Example:
Arguments:
string_scorer
- Optional custom scorer for string comparisons (default: Levenshtein)number_scorer
- Optional custom scorer for number comparisons (default: NumericDiff)preserve_strings
- Don't attempt to parse strings as JSON (default: False)
Returns:
Score object with:
- score: Similarity score between 0-1
- metadata: Detailed comparison breakdown
ValidJSON
Validate if a string is valid JSON and optionally matches a schema.
This scorer checks if:
- The input can be parsed as valid JSON
- The parsed JSON matches an optional JSON Schema
- Handles both string inputs and pre-parsed JSON objects
Example:
Arguments:
schema
- Optional JSON Schema to validate against
Returns:
Score object with:
- score: 1 if valid JSON (and matches schema if provided), 0 otherwise
- metadata: Validation details or error messages
autoevals.value
Value comparison utilities for exact matching and normalization.
This module provides tools for exact value comparison with smart handling of different data types:
- ExactMatch: A scorer for exact value comparison
- Handles primitive types (strings, numbers, etc.)
- Smart
JSON
serialization for objects and arrays - Normalizes
JSON
strings for consistent comparison
Example:
ExactMatch
A scorer that tests for exact equality between values.
This scorer handles various input types:
- Primitive values (strings, numbers, etc.)
- JSON objects (dicts) and arrays (lists)
- JSON strings that can be parsed into objects/arrays
The comparison process:
- Detects if either value is/might be a JSON object/array
- Normalizes both values (serialization if needed)
- Performs exact string comparison
Arguments:
output
- Value to evaluateexpected
- Reference value to compare against
Returns:
Score object with:
- score: 1.0 for exact match, 0.0 otherwise