Reference/API/Evals

Launch an eval

Launch an evaluation. This is the API-equivalent of the Eval function that is built into the Braintrust SDK. In the Eval API, you provide pointers to a dataset, task function, and scoring functions. The API will then run the evaluation, create an experiment, and return the results along with a link to the experiment. To learn more about evals, see the Evals guide.

POST /v1/eval

Authorization

Authorization
Required
Bearer <token>

Most Braintrust endpoints are authenticated by providing your API key as a header Authorization: Bearer [api_key] to your HTTP request. You can create an API key in the Braintrust organization settings page.

In: header

Request Body

application/jsonRequired

Eval launch parameters

project_id
Required
string

Unique identifier for the project to run the eval in

data
Required
Any properties in dataset_id, project_dataset_name, dataset_rows

The dataset to use

task
Required
Any properties in function_id, project_slug, global_function, prompt_session_id, inline_code, inline_prompt

The function to evaluate

scores
Required
array<Any properties in function_id, project_slug, global_function, prompt_session_id, inline_code, inline_prompt>

The functions to score the eval on

experiment_namestring

An optional name for the experiment created by this eval. If it conflicts with an existing experiment, it will be suffixed with a unique identifier.

metadataobject

Optional experiment-level metadata to store about the evaluation. You can later use this to slice & dice across experiments.

parentAny properties in span_parent_struct, string

Options for tracing the evaluation

streamboolean

Whether to stream the results of the eval. If true, the request will return two events: one to indicate the experiment has started, and another upon completion. If false, the request will return the evaluation's summary upon completion.

trial_countnumber

The number of times to run the evaluator per input. This is useful for evaluating applications that have non-deterministic behavior and gives you both a stronger aggregate measure and a sense of the variance in the results.

is_publicboolean

Whether the experiment should be public. Defaults to false.

timeoutnumber

The maximum duration, in milliseconds, to run the evaluation. Defaults to undefined, in which case there is no timeout.

max_concurrencynumber

The maximum number of tasks/scorers that will be run concurrently. Defaults to 10. If null is provided, no max concurrency will be used.

base_experiment_namestring

An optional experiment name to use as a base. If specified, the new experiment will be summarized and compared to this experiment.

base_experiment_idstring

An optional experiment id to use as a base. If specified, the new experiment will be summarized and compared to this experiment.

git_metadata_settingsobject

Optional settings for collecting git metadata. By default, will collect all git metadata fields allowed in org-level settings.

repo_infoobject
strictboolean

If true, throw an error if one of the variables in the prompt is not present in the input

stop_tokenstring

The token to stop the run

curl -X POST "https://api.braintrust.dev/v1/eval" \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{
    "project_id": "string",
    "data": {
      "dataset_id": "string",
      "_internal_btql": {
        "property1": null,
        "property2": null
      }
    },
    "task": {
      "function_id": "string",
      "version": "string"
    },
    "scores": [
      {
        "function_id": "string",
        "version": "string"
      }
    ],
    "experiment_name": "string",
    "metadata": {
      "property1": null,
      "property2": null
    },
    "parent": {
      "object_type": "project_logs",
      "object_id": "string",
      "row_ids": {
        "id": "string",
        "span_id": "string",
        "root_span_id": "string"
      },
      "propagated_event": {
        "property1": null,
        "property2": null
      }
    },
    "stream": true,
    "trial_count": 0,
    "is_public": true,
    "timeout": 0,
    "max_concurrency": 0,
    "base_experiment_name": "string",
    "base_experiment_id": "string",
    "git_metadata_settings": {
      "collect": "all",
      "fields": [
        "commit"
      ]
    },
    "repo_info": {
      "commit": "string",
      "branch": "string",
      "tag": "string",
      "dirty": true,
      "author_name": "string",
      "author_email": "string",
      "commit_message": "string",
      "commit_time": "string",
      "git_diff": "string"
    },
    "strict": true,
    "stop_token": "string"
  }'

Eval launch response

{
  "project_name": "string",
  "experiment_name": "string",
  "project_url": "http://example.com",
  "experiment_url": "http://example.com",
  "comparison_experiment_name": "string",
  "scores": {
    "property1": {
      "name": "string",
      "score": 1,
      "diff": -1,
      "improvements": 0,
      "regressions": 0
    },
    "property2": {
      "name": "string",
      "score": 1,
      "diff": -1,
      "improvements": 0,
      "regressions": 0
    }
  },
  "metrics": {
    "property1": {
      "name": "string",
      "metric": 0,
      "unit": "string",
      "diff": 0,
      "improvements": 0,
      "regressions": 0
    },
    "property2": {
      "name": "string",
      "metric": 0,
      "unit": "string",
      "diff": 0,
      "improvements": 0,
      "regressions": 0
    }
  }
}

On this page