file: ./content/docs/changelog.mdx meta: { "title": "Changelog" } # Changelog ## Week of 2025-02-24 * Add support for removing all permissions for a group/user on an object with a single click. * Add support for Claude 3.7 Sonnet model. * Add [llms.txt](/docs/llms.txt) for docs content. * Enable spellcheck for prompt message editors. * Add support for Anthropic Claude models in Vertex AI. * Add support for Claude 3.7 Sonnet in Bedrock and Vertex AI. * Add support for Perplexity R1 1776, Mistral Saba, Gemini LearnLM, and more Groq models. * Support system instructions in Gemini models. * Add support for Gemini 2.0 Flash-Lite, and remove preview model, which no longer serves requests. * Add support for default Bedrock cross-region inference profiles in the playground and AI proxy. * Move score distribution charts to the experiment sidebar. * Add support for OpenAI GPT-4.5 model in the playground and AI proxy. ### API (version 0.0.63) * Support for Claude 3.7 Sonnet, Gemini 2.0 Flash-Lite, and several other models in the proxy. * Stability and performance improvements for ETL processes. * A new `/status` endpoint to check the health of Braintrust services. ### SDK (version 0.0.185) \[upcoming] * Added support for handling score values when an Eval has errored. ## Week of 2025-02-17 * Add support for stop sequences in Anthropic, Bedrock, and Google models. * Resolve JSON Schema references when translating structured outputs to Gemini format. * Add button to copy table cell contents to clipboard. * Add support for basic Cache-Control headers in the AI proxy. * Add support for selecting all or none in the categories of permission dialogs. * Respect Bedrock providers not supporting streaming in the AI proxy. ### SDK (version 0.0.185) \[upcoming] * Improve support for binary packages in `npx braintrust eval`. * Support templated structured outputs. * Fix dataset summary types in Typescript. ## Week of 2025-02-10 * Store table grouping, row height, and layout options in the view configuration. * Add the ability to set a default table view. * Add support for Google Cloud Vertex AI in the playground and proxy. Google Cloud auth is supported for principals and service accounts via either OAuth 2.0 token or service account key. * Add default cloud providers section to the organization AI providers page. * Support streaming responses from OpenAI o1 models in the playground and AI proxy. ## Week of 2025-02-03 * Add complete support for Bedrock models in the playground and AI proxy; this includes support for system prompts, tool calls, and multimodal inputs. * Fix model provider configuration issues in which custom models could clobber default models, and different providers of the same type could clobber each other. * Fix bug in streaming JSON responses from non-OpenAI providers. * Supported templated structured outputs in experiments run from the playground. * Support structured outputs in the playground and AI proxy for Anthropic models, Bedrock models, and any OpenAI-flavored models that support tool calls, e.g. LLaMa on Together.ai. * Support templated custom headers for custom AI providers. See the [proxy docs](/docs/guides/proxy#custom-models) for more details. * Added and updated models across all providers in the playground and AI proxy. * Support tool usage and structured outputs for Gemini models in the playground and AI proxy. * Simplify playground model dropdown by showing model variations in a nested dropdown. ## Week of 2025-01-27 * Add support for duplicating prompts, scorers, and tools. * Fix pagination for the `/v1/prompt` REST API endpoint. * "Unreviewed" default view on experiment and logs tables to filter out rows that have been human reviewed. * Add o3-mini to the AI proxy and playground. * Scorer dropdown now supports using custom scoring functions across projects. ### SDK Integrations: LangChain.js (version 0.0.5) * Less noisy logging from the LangChain.js integration. * You can now pass a `NOOP_SPAN` to the `BraintrustCallbackHandler` to disable logging. * Fixes a bug where the LangChain.js integration could not handle null/undefined values in chain inputs/outputs. ### SDK (version 0.0.184) * `span.export()` will no longer throw if braintrust is down * Improvement to the Python prompt rendering to correctly render formatted messages, LLM tool calls, and other structured outputs. ## Week of 2025-01-20 * Drag and drop to reorder span fields in experiment/log traces and dataset rows. On wider screens, fields can also be arranged side-by-side. * Small convenience improvement to the BTQL Sandbox to avoid having to add include `filter:` to an advanced filter clause. * Add an attachments browser to view all attachments for a span in a sidebar. To open the attachments browser, expand the trace and click the arrow icon in the attachments section. It will only be visible when the trace panel is wide enough. ![Attachments browser](./reference/release-notes/open-attachments-browser.png) ### SDK (version 0.0.183) * Fix a bug related to `initDataset()` in the Typescript SDK creating links in `Eval()` calls. * Fix a few type checking issues in the Python SDK. ## Week of 2025-01-13 * Add support for setting a baseline experiment for experiment comparisons. If a baseline experiment is set, it will be chosen by default as the comparison when clicking on an experiment. * UI updates to experiment and log tables. * Trace audit log now displays granular changes to span data. * Start/end columns shown as dates/times. * Non-existent trace records display an error message instead of loading indefinitely. ### SDK Integrations: LangChain.js (version 0.0.4) * Support logging spans from inside evals in the LangChain.js integration. ### SDK (version 0.0.182) * Improved logging for moderation models from the SDK wrappers. ## Week of 2025-01-06 * Creating an experiment from a playground now correctly renders prompts with `input`, `metadata`, `expected`, and `output` mapped fields. * Fixes small bug where `input.output` data could pollute the dataset's `output` when rendering the prompts. * The [AI proxy](/docs/guides/proxy) now includes `x-bt-used-endpoint` as a response header. It specifies which of your configured AI providers was used to complete the request. * Add support for deeplinking to comments within spans, allowing users to easily copy and share links to comments. * In Human Review mode, display all scores in a form. * Experiment table rows can now be sorted based on score changes and regressions for each group, relative to a selected comparison experiment. * The OTEL endpoint now converts attributes under the `braintrust` namespace directly to the corresponding Braintrust fields. For example, `braintrust.input` will appear as `input` in Braintrust. See the [tracing guide](/docs/guides/tracing/integrations#manual-tracing) for more details. * New OTEL attributes that accept JSON-serialized values have been added for convenience: * `gen_ai.prompt_json` * `gen_ai.completion_json` * `braintrust.input_json` * `braintrust.output_json` For more details, see the [tracing guide](/docs/guides/tracing/integrations#manual-tracing). * Experiment tables and individual traces now support comparing trial data between experiments. ### SDK (version 0.0.181) * Add `ReadonlyAttachment.metadata` helper method to fetch a signed URL for downloading the attachment metadata. ### SDK (version 0.0.179) * New `hook.expected` for reading and updating expected values in the Eval framework. * Small type improvements for `hook` objects. * Fixed a bug to enable support for `init_function` with LLM scorers in Python. * Support nested attachments in Python. ## Week of 2024-12-30 * Add support for free-form human review scores (written to the `metadata` field). ### SDK (version 0.0.179) (unreleased) * Add support for imports in Python functions pushed to Braintrust via `braintrust push`. ### SDK (version 0.0.178) * Cache prompts locally in a two-layered memory/disk cache, and attempt to use this cache if the prompt cannot be fetched from the Braintrust server. * Support for using custom functions that are stored in Braintrust in evals. See the [docs](/docs/guides/evals/write#using-custom-promptsfunctions-from-braintrust) for more details. * Add support for running traced functions in a `ThreadPoolExecutor` in the Python SDK. See the [customize traces guide](/docs/guides/traces/customize) for more information. * Improved formatting of spans logged from the Vercel AI SDK's `generateObject` method. The logged output now matches the format of OpenAI's structured outputs. * Default to `asyncFlush: true` in the TypeScript SDK. This is usually safe since Vercel and Cloudflare both have `waitUntil`, and async flushes mean that clients will not be blocked if Braintrust is down. ### SDK integrations: LangChain.js (version 0.0.2) * Add support for initializing global LangChain callback handler to avoid manually passing the handler to each LangChain object. ## Week of 2024-12-16 ### API (version 0.0.61) * Upgraded to Node.js 22 in Docker containers. ### SDK (version 0.0.177) * Support for creating and pushing custom scorers from your codebase with `braintrust push`. Read the guides to [scorers](/docs/guides/functions/scorers) for more information. ## Week of 2024-12-09 * Add support for structured outputs in the playground. ![Structured outputs](./reference/release-notes/structured-outputs.gif) * Sparkline charts added to the project home page. * Better handling of missing data points in monitor charts. * Clicking on monitor charts now opens a link to traces filtered to the selected time range. * Add `Endpoint supports streaming` flag to custom provider configuration. The [AI proxy](/docs/guides/proxy) will convert non-streaming endpoints to streaming format, allowing the provider's models to be used in the playground. * Experiments chart can be resized vertically by dragging the bottom of the chart. * BTQL sandbox to explore project data using [Braintrust Query Language](/docs/reference/btql). * Add support for updating span data from custom span iframes. ### Autoevals (version 0.0.110) * Python Autoevals now support custom clients when calling evaluators. See [docs](https://pypi.org/project/autoevals/) for more details. ### SDK (version 0.0.176) * New `hook.metadata` for reading and updating Eval metadata when using the `Eval` framework. Previous `hook.meta` is now deprecated. ### SDK integrations: LangChain.js (version 0.0.1) * New LangChain.js integration to export traces from `langchainjs` runs. ### SDK integrations: LangChain.js (version 0.0.1) * New LangChain.js integration to export traces from `langchainjs` runs. ## Week of 2024-12-02 * Significantly speed up loading performance for experiments and logs, especially with lots of spans. This speed up comes with a few changes in behavior: * Searches inside experiments will only work over content in the tabular view, rather than over the full trace. * While searching on the logs page, realtime updates are disabled. * Starring rows in experiment and dataset tables now supported. * "Order by regression" option in experiment column menu can now be toggled on and off without losing previous order. * Add expanded timeline view for traces. * Added a 'Request count' chart to the monitor page. * Add headers to custom provider configuration which the [AI proxy](/docs/guides/proxy) will include in the request to the custom endpoint. * The logs viewer now supports exporting the currently loaded rows as a CSV or JSON file. ### API (version 0.0.60) * Make PG\_URL configuration more uniform between nodeJS and python clients. ### SDK (version 0.0.175) * Fix bug with serializing ReadonlyAttachment in logs ## Week of 2024-11-25 * Experiment columns can now be reordered from the column menu. * You can now customize legends in monitor charts. Select a legend item to highlight its data, Shift (⇧) + Click to select multiple items, or Command (⌘) / Ctrl (⌃) + Click to deselect. ### SDK (version 0.0.174) * AI SDK fixes: support for image URLs and properly formatted tool calls so "Try prompt" works in the UI. ### SDK (version 0.0.173) * Attachments can now be loaded when iterating an experiment or dataset. ### SDK (version 0.0.172) * Fix a bug where `braintrust eval` did not respect certain configuration options, like `base_experiment_id`. * Fix a bug where `invoke` in the Python SDK did not properly stream responses. ## Week of 2024-11-18 * The Traceloop OTEL integration now uses the input and output attributes to populate the corresponding fields in Braintrust. * The monitor page now supports querying experiment metrics. * Removed the `filters` param from the REST API fetch endpoint. For complex queries, we recommend using the `/btql` endpoint ([docs](/docs/reference/btql)). * New experiment summary layout option, a url-friendly view for experiment summaries that respects all filters. * Add a default limit of 10 to all fetch and `/btql` requests for project\_logs. * You can now export your prompts from the playground as code snippets and run them through the [AI proxy](/docs/guides/proxy). * Add a fallback for the "add prompt" dropdown button in the playground, which will search for prompts within the current project if the cross-org prompts query fails. ### SDK (version 0.0.171) * Add a `.data` method to the `Attachment` class, which lets you inspect the loaded attachment data. ## Week of 2024-11-12 * Support for creating and pushing custom Python tools and prompts from your codebase with `braintrust push`. Read the guides to [tools](/docs/guides/functions/tools) and [prompts](/docs/guides/functions/prompts) for more information. * You can now view grouped summary data for all experiments by selecting **Include comparisons in group** from the **Group by** dropdown inside an experiment. * The experiments page now supports downloading as CSV/JSON. * Downloading or duplicating a dataset in the UI now properly copies all dataset rows. * You can now view a score data as a bar chart for your experiments data by selecting **Score comparison** from the X axis selector. * Trials information is now shown as a separate column in diff mode in the experiment table. * Cmd/Ctrl + S hotkey to save from prompts in the playground and function dialogs. ### SDK (version 0.0.170) * Support uploading [file attachments in the Python SDK](/docs/reference/libs/python#attachment-objects). * Log, feedback, and dataset inputs to the Python SDK are now synchronously deep-copied for more consistent logging. ### SDK (version 0.0.169) * The Python SDK `Eval()` function has been split into `Eval()` and `EvalAsync()` to make it clear which one should be called in an asynchronous context. The behavior of `Eval()` remains unchanged. However, `Eval()` callers running in an asynchronous context are strongly recommended to switch to `EvalAsync()` to improve type safety. * Improved type annotations in the Python SDK. ### SDK (version 0.0.168) * A new `Span.permalink()` method allows you to format a permalink for the current span. See [TypeScript docs](/docs/reference/libs/nodejs/interfaces/Span#permalink) or [Python docs](/docs/reference/libs/python#permalink) for details. * `braintrust push` support for Python tools and prompts. ## Week of 2024-11-04 * The Braintrust [AI Proxy](/docs/guides/proxy) now supports the [OpenAI Realtime API](https://platform.openai.com/docs/guides/realtime), providing observability for voice-to-voice model sessions and simplifying backend infrastructure. * Add "Group by" functionality to the monitor page. * The experiment table can now be visualized in a [grid layout](/docs/guides/evals/interpret#grid-layout), where each column represents an experiment to compare long-form outputs side-by-side. * 'Select all' button in permission dialogs * Create custom columns on dataset, experiment and logs tables from `JSON` values in `input`, `output`, `expected`, or `metadata` fields. ### API (version 0.0.59) * Fix permissions bug with updating org-scoped env vars ## Week of 2024-10-28 * The Braintrust [AI Proxy](/docs/guides/proxy) can now [issue temporary credentials](/docs/guides/proxy#api-key-management) to access the proxy for a limited time. This can be used to make AI requests directly from frontends and mobile apps, minimizing latency without exposing your API keys. * Move experiment score summaries to the table column headers. To view improvements and regressions per metadata or input group, first group the table by the relevant field. Sooo much room for \[table] activities! * You now receive a clear error message if you run out of free tier capacity while running an experiment from the playground. * Filters on JSON fields now support array indexing, e.g. `metadata.foo[0] = 'bar'`. See [docs](/docs/reference/btql#Expressions). ### SDK (version 0.0.168) * `initDataset()`/`init_dataset()` used in `Eval()` now tracks the dataset ID and links to each row in the dataset properly. ## Week of 2024-10-21 * Preview [file attachments](/docs/guides/tracing#uploading-attachments) in the trace view. * View and filter by comments in the experiment table. * Add table row numbers to experiments, logs, and datasets. ### SDK (version 0.0.167) * Support uploading [file attachments in the TypeScript SDK](/docs/reference/libs/nodejs/classes/Attachment). * Log, feedback, and dataset inputs to the TypeScript SDK are now synchronously deep-copied for more consistent logging. * Address an issue where the TypeScript SDK could not make connections when running in a Cloudflare Worker. ### API (version 0.0.59) * Support uploading [file attachments](/docs/reference/libs/nodejs/classes/Attachment). * You can now export [OpenTelemetry (OTel)](https://opentelemetry.io/docs/specs/otel/) traces to Braintrust. See the [tracing guide](/docs/guides/tracing/integrations#opentelemetry-otel) for more details. ## Week of 2024-10-14 * The Monitor page now shows an aggregate view of log scores over time. * Improvement/Regression filters between experiments are now saved to the URL. * Add `max_concurrency` and `trial_count` to the playground when kicking off evals. `max_concurrency` is useful to avoid hitting LLM rate limits, and `trial_count` is useful for evaluating applications that have non-deterministic behavior. * Show a button to scroll to a single search result in a span field when using trace search. * Indicate spans with errors in the trace span list. ### SDK (version 0.0.166) * Allow explicitly specifying git metadata info in the Eval framework. ### SDK (version 0.0.165) * Support specifying dataset-level metadata in `initDataset/init_dataset`. ### SDK (version 0.0.164) * Add `braintrust.permalink` function to create deep links pointing to particular spans in the Braintrust UI. ## Week of 2024-10-07 * After using "Copy to Dataset" to create a new dataset row, the audit log of the new row now links back to the original experiment, log, or other dataset. * Tools now stream their `stdout` and `stderr` to the UI. This is helpful for debugging. * Fix prompt, scorer, and tool dropdowns to only show the correct function types. ### SDK (version 0.0.163) * Fix Python SDK compatibility with Python 3.8. ### SDK (version 0.0.162) * Fix Python SDK compatibility with Python 3.9 and older. ### SDK (version 0.0.161) * Add utility function `spanComponentsToObjectId` for resolving the object ID from an exported span slug. ## Week of 2024-09-30 * The [Github action](/docs/guides/evals/run#github-action) now supports Python runtimes. * Add support for [Cerebras](https://cerebras.ai/) models in the proxy, playground, and saved prompts. * You can now create [span iframe viewers](/docs/guides/tracing#custom-span-iframes) to visualize span data in a custom iframe. In this example, the "Table" section is a custom span iframe. ![Span iframe](./guides/traces/span-iframe.png) * `NOT LIKE`, `NOT ILIKE`, `NOT INCLUDES`, and `NOT CONTAINS` supported in BTQL. * Add "Upload Rows" button to insert rows into an existing dataset from CSV or JSON. * Add "Maximum" aggregate score type. * The experiment table now supports grouping by input (for trials) or by a metadata field. * The Name and Input columns are now pinned * Gemini models now support multimodal inputs. ## Week of 2024-09-23 * Basic monitor page that shows aggregate values for latency, token count, time to first token, and cost for logs. * Create custom tools to use in your prompts and in the playground. See the [docs](/docs/guides/prompts#calling-external-tools) for more details. * Set org-wide environment variables to use in these tools * Pull your prompts to your codebase using the `braintrust pull` command. * Select and compare multiple experiments in the experiment view using the `compared with` dropdown. * The playground now displays aggregate scores (avg/max/min) for each prompt and supports sorting rows by a score. * Compare span field values side-by-side in the trace viewer when fullscreen and diff mode is enabled. ### SDK (version 0.0.160) * Fix a bug with `setFetch()` in the TypeScript SDK. ### SDK (version 0.0.159) * In Python, running the CLI with `--verbose` now uses the `INFO` log level, while still printing full stack traces. Pass the flag twice (`-vv`) to use the `DEBUG` log level. * Create and push custom tools from your codebase with `braintrust push`. See [docs](/docs/guides/prompts#calling-external-tools) for more details. TypeScript only for now. * A long awaited feature: you can now pull prompts to your codebase using the `braintrust pull` command. TypeScript only for now. ### API (version 0.0.56) * Hosted tools are now available in the API. * Environment variables are now supported in the API (not yet in the standard REST API). See the [docker compose file](https://github.com/braintrustdata/braintrust-deployment/blob/main/docker/docker-compose.api.yml#L65) for information on how to configure the secret used to encrypt them if you are using Docker. * Automatically backfill `function_data` for prompts created via the API. ## Week of 2024-09-16 * The tag picker now includes tags that were added dynamically via API, in addition to the tags configured for your project. * Added a REST API for managing AI secrets. See [docs](/docs/reference/api/AiSecrets). ### SDK (version 0.0.158) * A dedicated `update` method is now available for datasets. * Fixed a Python-specific error causing experiments to fail initializing when git diff --cached encounters invalid or inaccessible Git repositories. * Token counts have the correct units when printing `ExperimentSummary` objects. * In Python, `MetricSummary.metric` could have an `int` value. The type annotation has been updated. ## Week of 2024-09-09 * You can now create server-side online evaluations for your logs. Online evals support both [autoevals](/docs/reference/autoevals) and [custom scorers](/docs/guides/playground) you define as LLM-as-a-judge, TypeScript, or Python functions. See [docs](/docs/guides/evals/write#online-evaluation) for more details. * New member invitations now support being added to multiple permission groups. * Move datasets and prompts to a new Library navigation tab, and include a list of custom scorers. * Clean up tree view by truncating the root preview and showing a preview of a node only if collapsed. ![Truncated tree view](./reference/release-notes/truncated-tree-view.png) * Automatically save changes to table views. ## Week of 2024-09-02 * You can now upload typescript evals from the command line as functions, and then use them in the playground. * Click a span field line to highlight it and pin it to the URL. * Copilot tab autocomplete for prompts and data in the Braintrust UI. ```bash # This will bundle and upload the task and scorer functions to Braintrust npx braintrust eval --bundle ``` ### API (version 0.0.54) * Support for bundled eval uploads. * The `PATCH` endpoint for prompts now supports updating the `slug` field. ### SDK (version 0.0.157) * Enable the `--bundle` flag for `braintrust eval` in the TypeScript SDK. ## Week of 2024-08-26 * Basic filter UI (no BTQL necessary) * Add to dataset dropdown now supports adding to datasets across projects. * Add REST endpoint for batch-updating ACLs: `/v1/acl/batch_update`. * Cmd/Ctrl click on a table row to open it in a new tab. * Show the last 5 basic filters in the filter editor. * You can now explicitly set and edit prompt slugs. ### Autoevals (version 0.0.86) * Add support for Azure OpenAI in node. ### SDK (version 0.0.155) * The client wrappers `wrapOpenAI()`/`wrap_openai()` now support [Structured Outputs](https://platform.openai.com/docs/guides/structured-outputs). ### API (version 0.0.54) * Don't fail insertion requests if realtime broadcast fails ## Week of 2024-08-19 * Fixed comment deletion. * You can now use `%` in BTQL queries to represent percent values. E.g. `50%` will be interpreted as `0.5`. ### API (version 0.0.54) * Performance optimizations to filters on `scores`, `metrics`, and `created` fields. * Performance optimizations to filter subfields of `metadata` and `span_attributes`. ## Week of 2024-08-12 * You can now create custom LLM and code (TypeScript and Python) evaluators in the playground. * Fullscreen trace toggle * Datasets now accept JSON file uploads * When uploading a CSV/JSON file to a dataset, columns/fields named `input`, `expected`, and `metadata` are now auto-assigned to the corresponding dataset fields * Fix bug in logs/dataset viewer when changing the search params. ### API (version 0.0.53) * The API now supports running custom LLM and code (TypeScript and Python) functions. To enable this in the: * AWS Cloudformation stack: turn on the `EnableQuarantine` parameter * Docker deployment: set the `ALLOW_CODE_FUNCTION_EXECUTION` environment variable to `true` ## Week of 2024-08-05 * Full text search UI for all span contents in a trace * New metrics in the UI and summary API: prompt tokens, completion tokens, total tokens, and LLM duration * These metrics, along with cost, now exclude LLM calls used in autoevals (as of 0.0.85) * Switching organizations via the header navigates to the same-named project in the selected organization * Added `MarkAsyncWrapper` to the Python SDK to allow explicitly marking functions which return awaitable objects as async ### Autoevals (version 0.0.85) * LLM calls used in autoevals are now marked with `span_attributes.purpose = "scorer"` so they can be excluded from metric and cost calculations. ### Autoevals (version 0.0.84) * Fix a bug where `rationale` was incorrectly formatted in Python. * Update the `full` docker deployment configuration to bundle the metadata DB (supabase) inside the main docker compose file. Thus no separate supabase cluster is required. See [docs](/docs/guides/self-hosting/docker#full-configuration) for details. If you are upgrading an existing full deployment, you will likely want to mark the supabase db volumes `external` to continue using your existing data (see comments in the `docker-compose.full.yml` file for more details). ### SDK (version 0.0.151) * `Eval()` can now take a base experiment. Provide either `baseExperimentName`/`base_experiment_name` or `baseExperimentId`/`base_experiment_id`. ## Week of 2024-07-29 * Errors now show up in the trace viewer. * New cookbook recipe on [benchmarking LLM providers](/docs/cookbook/recipes/ProviderBenchmark). * Viewer mode selections will no longer automatically switch to a non-editable view if the field is editable and persist across trace/span changes. * Show `%` in diffs instead of `pp`. * Add rename, delete and copy current project id actions to the project dropdown. * Playgrounds can now be shared publicly. * Duration now reflects the "task" duration not the overall test case duration (which also includes scores). * Duration is now also displayed in the experiment overview table. * Add support for Fireworks and Lepton inference providers. * "Jump to" menu to quickly navigate between span sections. * Speed up queries involving metadata fields, e.g. `metadata.foo ILIKE '%bar%'`, using the columnstore backend if it is available. * Added `project_id` query param to REST API queries which already accept `project_name`. E.g. [GET experiments](/docs/reference/api/Experiments#list-experiments). * Update to include the latest Mistral models in the proxy/playground. ### SDK (version 0.0.148) * While tracing, if your code errors, the error will be logged to the span. You can also manually log the `error` field through the API or the logging SDK. ### SDK (version 0.0.147) * `project_name` is now `projectName`, etc. in the `invoke(...)` function in TypeScript * `Eval()` return values are printed in a nicer format (e.g. in Notebooks) * [`updateSpan()`/`update_span()`](/docs/guides/tracing#updating-spans) allows you to update a span's fields after it has been created. ## Week of 2024-07-22 * Categorical human review scores can now be re-ordered via Drag-n-Drop. ![Reorder categorical score](./reference/release-notes/category-score-reorder.gif) * Human review row selection is now a free text field, enabling a quick jump to a specific row. ![Human review free text](./reference/release-notes/humanreviewfreetext.png) * Added REST endpoint for managing org membership. See [docs](/docs/reference/api/Organizations#modify-organization-membership). ### API (version 0.0.51) * The proxy is now a first-class citizen in the API service, which simplifies deployment and sets the groundwork for some exciting new features. Here is what you need to know: * The updates are available as of API version 0.0.51. * The proxy is now accessible at `https://api.braintrust.dev/v1/proxy`. You can use this as a base URL in your OpenAI client, instead of `https://braintrustproxy.com/v1`. \[NOTE: The latter is still supported, but will be deprecated in the future.] * If you are self-hosting, the proxy is now bundled into the API service. That means you no longer need to deploy the proxy as a separate service. * If you have deployed through AWS, after updating the Cloudformation, you'll need to grab the "Universal API URL" from the "Outputs" tab. ![Universal URL Cloudformation](./reference/release-notes/universal-url-cloudformation.png) * Then, replace that in your settings page settings page ![Universal API](./reference/release-notes/universal-api.png) * If you have a Docker-based deployment, you can just update your containers. * Once you see the "Universal API" indicator, you can remove the proxy URL from your settings page, if you have it set. ### SDK (version 0.0.146) * Add support for `max_concurrency` in the Python SDK * Hill climbing evals that use a `BaseExperiment` as data will use that as the default base experiment. ## Week of 2024-07-15 * In preparation for auth changes, we are making a series of updates that may affect self-deployed instances: * Preview URLs will now be subdomains of `*.preview.braintrust.dev` instead of `vercel.app`. Please add this domain to your allow list. * To continue viewing preview URLs, you will need to update your stack (to update the allow list to include the new domain pattern). * The data plane may make requests back to `*.preview.braintrust.dev` URLs. This allows you to test previews that include control plane changes. You may need to whitelist traffic from the data plane to `*.preview.braintrust.dev` domains. * Requests will optionally send an additional `x-bt-auth-token` header. You may need to whitelist this header. * User impersonation through the `x-bt-impersonate-user` header now accepts either the user's id or email. Previously only user id was accepted. ### Autoevals (version 0.0.80) * New `ExactMatch` scorer for comparing two values for exact equality. ### Autoevals (version 0.0.77) * Officially switch the default model to be `gpt-4o`. Our testing showed that it performed on average 10% more accurately than `gpt-3.5-turbo`! * Support claude models (e.g. claude-3-5-sonnet-20240620). You can use them by simply specifying the `model` param in any LLM based evaluator. * Under the hood, this will use the proxy, so make sure to configure your Anthropic API keys in your settings. ## Week of 2024-07-08 * Human review scores are now sortable from the project configuration page. ![Reorder scores](./reference/release-notes/reorder-human-review-scores.gif) * Streaming support for tool calls in Anthropic models through the proxy and playground. * The playground now supports different "parsing" modes: * `auto`: (same as before) the completion text and the first tool call arguments, if any * `parallel`: the completion text and a list of all tool calls * `raw`: the completion in the OpenAI non-streaming format * `raw_stream`: the completion in the OpenAI streaming format * Cleaned up environment variables in the public [docker deployment](https://github.com/braintrustdata/braintrust-deployment/tree/main/docker). Functionally, nothing has changed. ### Autoevals (version 0.0.76) * New `.partial(...)` syntax to initialize a scorer with partial arguments like `criteria` in `ClosedQA`. * Allow messages to be inserted in the middle of a prompt. ## Week of 2024-07-01 * Table views [can now be saved](/docs/reference/views), persisting the BTQL filters, sorts, and column state. * Add support for the new `window.ai` model into the playground. ![window.ai](./reference/release-notes/window-ai.gif) * Use push history when navigating table rows to allow for back button navigation. * In the experiments list, grouping by a metadata field will group rows in the table as well. * Allow the trace tree panel to be resized. * Port the log summary query to BTQL. This should speed up the query, especially if you have clickhouse configured in your cloud environment. This functionality requires upgrading your data backend to version 0.0.50. ### SDK (version 0.0.140) * New `wrapTraced` function allows you to trace javascript functions in a more ergonomic way. ```typescript #skip-compile import { wrapTraced } from "braintrust"; const foo = wrapTraced(async function foo(input) { const resp = await client.chat.completions.create({ model: "gpt-3.5-turbo", messages: [{ role: "user", content: input }], }); return resp.choices[0].message.content ?? "unknown"; }); ``` ### SDK (version 0.0.138) * The TypeScript SDK's `Eval()` function now takes a `maxConcurrency` parameter, which bounds the number of concurrent tasks that run. * `braintrust install api` now sets up your API and Proxy URL in your environment. * You can now specify a custom `fetch` implementation in the TypeScript SDK. ## Week of 2024-06-24 * Update the experiment progress and experiment score distribution chart layouts * Format table column headers with icons * Move active filters to the table toolbar * Enable RBAC for all users. When inviting a new member, prompt to add that member to an RBAC Permission group. * Use btql to power the datasets list, making it significantly faster if you have multiple large datasets. * Experiments list chart supports click interactions. Left click to select an experiment, right click to add an annotation. * Jump into comparison view between 2 experiments by selecting them in the table an clicking "Compare" ### Deployment * The proxy service now supports more advanced functionality which requires setting the `PG_URL` and `REDIS_URL` parameters. If you do not set them, the proxy will still run without caching credentials or requests. ## Week of 2024-06-17 * Add support for labeling [expected fields using human review](/docs/guides/human-review#writing-categorical-scores-to-expected-field). * Create and edit descriptions for datasets. * Create and edit metadata for prompts. * Click scores and attributes (tree view only) in the trace view to filter by them. * Highlight the experiments graph to filter down the set of experiments. * Add support for new models including Claude 3.5 Sonnet. ## Week of 2024-06-10 * Improved empty state and instructions for custom evaluators in the playground. * Show query examples when filtering/sorting. * [Custom comparison keys](/docs/guides/evals/interpret#customizing-the-comparison-key) for experiments. * New model dropdown in the playground/prompt editor that is organized by provider and model type. ## Week of 2024-06-03 * You can now collapse the trace tree. It's auto collapsed if you have a single span. ![Collapsible trace tree](./reference/release-notes/trace-tree.png) * Improvements to the experiment chart including greyed out lines for inactive scores and improved legend. * Show diffs when you save a new prompt version. ![Prompt diff](./reference/release-notes/save-prompt.png) ## Week of 2024-05-27 * You can now see which users are viewing the same traces as you are in real-time. * Improve whitespace and presentation of diffs in the trace view. * Show markdown previews in score editor. * Show cost in spans and display the average cost on experiment summaries and diff views. * Published a new [Text2SQL eval recipe](/docs/cookbook/recipes/Text2SQL-Data) * Add groups view for RBAC. ## Week of 2024-05-20 * Deprecate the legacy dataset format (`output` in place of `expected`) in a new version of the SDK (0.0.130). For now, data can still be fetched in the legacy format by setting the `useOutput` / `use_output` flag to false when using `initDataset()` / `init_dataset()`. We recommend updating your code to use datasets with `expected` instead of `output` as soon as possible. * Improve the UX for saving and updating prompts from the playground. * New hide/show column controls on all tables. * New [model comparison](/docs/cookbook/recipes/ModelComparison) cookbook recipe. * Add support for model / metadata comparison on the experiments view. * New experiment picker dropdown. * Markdown support in the LLM message viewer. ## Week of 2024-05-13 * Support copying to clipboard from `input`, `output`, etc. views * Improve the empty-state experience for datasets. * New multi-dimensional charts on the experiment page for comparing models and model parameters. * Support `HTTPS_PROXY`, `HTTP_PROXY`, and `NO_PROXY` environment variables in the API containers. * Support infinite scroll in the logs viewer and remove dataset size limitations. ## Week of 2024-05-06 * Denser trace view with span durations built in. * Rework pagination and fix scrolling across multiple pages in the logs viewer. * Make BTQL the default search method. * Add support for Bedrock models in the playground and the proxy. * Add "copy code" buttons throughout the docs. * Automatically overflow large objects (e.g. experiments) to S3 for faster loading and better performance. ## Week of 2024-04-29 * Show images in LLM view, adding the ability to display images in the LLM view in the trace viewer. ![Images in playground](./reference/release-notes/326593724-6a33c3f9-6aad-44a8-b978-d1d8245dcc66.png) * Send an invite email when you invite a new user to your organization. * Support selecting/deselecting scores in the experiment view. * Roll out [Braintrust Query Language](/docs/reference/btql) (BTQL) for querying logs and traces. ## Week of 2024-04-22 * Smart relative time labels for dates (`1h ago`, `3d ago`, etc.) * Added double quoted string literals support, e.g., `tags contains "foo"`. * Jump to top button in trace details for easier navigation. * Fix a race condition in distributed tracing, in which subspans could hit the backend before their parent span, resulting in an inaccurate trace structure. As part of this change, we removed the `parent_id` argument from the latest SDK, which was previously deprecated in favor of `parent`. `parent_id` is only able to use the race-condition-prone form of distributed tracing, so we felt it would be best for folks to upgrade any of their usages from `parent_id` to `parent`. Before upgrading your SDK, if you are currently using `parent_id`, you can port over to using `parent` by changing any exported IDs from `span.id` to `span.export()` and then changing any instances of `parent_id=[span_id]` to `parent=[exported_span]`. For example, if you had distributed tracing code like the following: ```javascript #skip-compile import { initLogger } from "braintrust"; const logger = initLogger({ projectName: "My Project", apiKey: process.env.BRAINTRUST_API_KEY, }); export async function POST(req: Request) { return logger.traced(async (span) => { const { body } = req; const result = await someLLMFunction(body); span.log({ input: body, output: result }); return { result, requestId: span.id, }; }); } export async function POSTFeedback(req: Request) { logger.traced( async (span) => { logger.logFeedback({ id: span.id, // Use the newly created span's id, instead of the original request's id comment: req.body.comment, scores: { correctness: req.body.score, }, metadata: { user_id: req.user.id, }, }); }, { parentId: req.body.requestId, name: "feedback", }, ); } ``` ```python from braintrust import init_logger logger = init_logger(project="My Project") def my_route_handler(req): with logger.start_span() as span: body = req.body result = some_llm_function(body) span.log(input=body, output=result) return { "result": result, "request_id": span.id, } def my_feedback_handler(req): with logger.start_span("feedback", parent_id=req.body.request_id) as span: logger.log_feedback( id=span.id, # Use the newly created span's id, instead of the original request's id scores={ "correctness": req.body.score, }, comment=req.body.comment, metadata={ "user_id": req.user.id, }, ) ``` It would now look like this: ```javascript #skip-compile import { initLogger } from "braintrust"; const logger = initLogger({ projectName: "My Project", apiKey: process.env.BRAINTRUST_API_KEY, }); export async function POST(req: Request) { return logger.traced(async (span) => { const { body } = req; const result = await someLLMFunction(body); span.log({ input: body, output: result }); return { result, requestId: span.export(), }; }); } export async function POSTFeedback(req: Request) { logger.traced( async (span) => { logger.logFeedback({ id: span.id, // Use the newly created span's id, instead of the original request's id comment: req.body.comment, scores: { correctness: req.body.score, }, metadata: { user_id: req.user.id, }, }); }, { parent_id: req.body.requestId, name: "feedback", }, ); } ``` ```python from braintrust import init_logger logger = init_logger(project="My Project") def my_route_handler(req): with logger.start_span() as span: body = req.body result = some_llm_function(body) span.log(input=body, output=result) return { "result": result, "request_id": span.export(), } def my_feedback_handler(req): with logger.start_span("feedback", parent=req.body.request_id) as span: logger.log_feedback( id=span.id, # Use the newly created span's id, instead of the original request's id scores={ "correctness": req.body.score, }, comment=req.body.comment, metadata={ "user_id": req.user.id, }, ) ``` ## Week of 2024-04-15 * Incremental support for roles-based access control (RBAC) logic within the API server backend. As part of this change, we removed certain API endpoints which are no longer in use. In particular, the `/crud/{object_type}` endpoint. For the handful of usages of these endpoints in old versions of the SDK libraries, we added backwards-compatibility routes, but it is possible we may have missed a few. Please let us know if your code is trying to use an endpoint that no longer exists and we can remediate. * Changed the semantics of experiment initialization with `update=True`. Previously, we would require the experiment to already exist, now we will create the experiment if it doesn't already exist otherwise return the existing one. This change affects the semantics of the `PUT /v1/experiment` operation, so that it will not replace the contents of an existing experiment with a new one, but instead just return the existing one, meaning it behaves the same as `POST /v1/experiment`. Eventually we plan to revise the update semantics for other object types as well. Therefore, we have deprecated the `PUT` endpoint across the board and plan to remove it in a future revision of the API. ## Week of 2024-04-08 * Added support for new multimodal models (`gpt-4-turbo`, `gpt-4-vision-preview`, `gpt-4-1106-vision-preview`, `gpt-4-turbo-2024-04-09`, `claude-3-opus-20240229`, `claude-3-sonnet-20240229`, `claude-3-haiku-20240307`). * Introduced [REST API for RBAC](/docs/api/spec#roles) (Role-Based Access Control) objects including CRUD operations on roles, groups, and permissions, and added a read-only API for users. * Improved AI search and added positive/negative tag filtering in AI search. To positively filter, prefix the tag with `+`, and to negatively filter, prefix the tag with `-`. We are making some systematic changes to the search experience, and the search syntax is subject to change. ## Week of 2024-04-01 * Added functionality for distributed tracing. See the [docs](/docs/guides/tracing#distributed-tracing) for more details. As part of this change, we had to rework the core logging implementation in the SDKs to rely on some newer backend API features. Therefore, if you are hosting Braintrust on-prem, before upgrading your SDK to any version `>= 0.0.115`, make sure your API version is `>= 0.0.35`. You can query the version of the on-prem server with `curl [api-url]/version`, where the API URL can be found on the settings page. ## Week of 2024-03-25 * Introduce multimodal support for OpenAI and Anthropic models in the prompt playground and proxy. You can now pass image URLs, base64-encoded image strings, or mustache template variables to models that support multimodal inputs. ![Multimodal prompt](./reference/release-notes/multimodal-prompt.gif) * The REST API now gzips responses. * You can now return dynamic arrays of scores in `Eval()` functions ([docs](/docs/guides/evals#dynamic-scoring)). * Launched [Reporters](/docs/guides/evals#custom-reporters), a way to summarize and report eval results in a custom format. * New coat of paint in the trace view. * Added support for Clickhouse as an additional storage backend, offering a more scalable solution for handling large datasets and performance improvements for certain query types. You can enable it by setting the `UseManagedClickhouse` parameter to `true` in the CloudFormation template or installing the docker container. * Implemented realtime checks using a WebSocket connection and updated proxy configurations to include CORS support. * Introduced an API version checker tool so you know when your API version is outdated. ## Week of 2024-03-18 * Add new database parameters for external databases in the CloudFormation template. * Faster optimistic updates for large writes in the UI. * "Open in playground" now opens a lighter weight modal instead of the full playground. * Can create a new prompt playground from the prompt viewer. ## Week of 2024-03-11 * Shipped support for [prompt management](/docs/guides/prompts). * Moved playground sessions to be within projects. All existing sessions are now in the "Playground Sessions" project. * Allowed customizing proxy and real-time URLs through the web application, adding flexibility for different deployment scenarios. * Improved documentation for Docker deployments. * Improved folding behavior in data editors. ## Week of 2024-03-04 * Support custom models and endpoint configuration for all providers. * New add team modal with support for multiple users. * New information architecture to enable faster project navigation. * Experiment metadata now visible in the experiments table. * Improve UI write performance with batching. * Log filters now apply to *any* span. * Share button for traces * Images now supported in the tree view (see [tracing docs](/docs/guides/tracing#multimodal-content) for more). ## Week of 2024-02-26 * Show auto scores before manual scores (matching trace) in the table * New logo is live! * Any span can now submit scores, which automatically average in the trace. This makes it easier to label scores in the spans where they originate. * Improve sidebar scrolling behavior. * Add AI search for datasets and logs. * Add tags to the SDK. * Support viewing and updating metadata on the experiment page. ## Week of 2024-02-19 We rolled out a breaking change to the REST API that renames the `output` field to `expected` on dataset records. This change brings the API in line with [last week's update](#week-of-2024-02-12) to the Braintrust SDK. For more information, refer to the REST API docs for dataset records ([insert](/docs/api/spec#insert-dataset-events) and [fetch](/docs/api/spec#fetch-dataset-get-form)). * Add support for [tags](/docs/guides/logging#tags-and-queues). * Score fields are now sorted alphabetically. * Add support for Groq ModuleResolutionKind. * Improve tree viewer and XML parser. * New experiment page redesign ## Week of 2024-02-12 We are rolling out a change to dataset records that renames the `output` field to `expected`. If you are using the SDK, datasets will still fetch records using the old format for now, but we recommend future-proofing your code by setting the `useOutput` / `use_output` flag to false when calling `initDataset()` / `init_dataset()`, which will become the default in a future version of Braintrust. When you set `useOutput` to false, your dataset records will contain `expected` instead of `output`. This makes it easy to use them with `Eval(...)` to provide expected outputs for scoring, since you'll no longer have to manually rename `output` to `expected` when passing data to the evaluator: ```typescript import { Eval, initDataset } from "braintrust"; import { Levenshtein } from "autoevals"; Eval("My Eval", { data: initDataset("Existing Dataset", { useOutput: false }), // Records will contain `expected` instead of `output` task: (input) => "foo", scores: [Levenshtein], }); ``` ```python from braintrust import Eval, init_dataset from autoevals import Levenshtein Eval( "My Eval", data=init_dataset("Existing Dataset", use_output=False), # Records will contain `expected` instead of `output` task=lambda input: "foo", scores=[Levenshtein], ) ``` Here's an example of how to insert and fetch dataset records using the new format: ```typescript #skip-compile import { initDataset } from "braintrust"; // Currently `useOutput` defaults to true, but this will change in a future version of Braintrust. const dataset = initDataset("My Dataset", { useOutput: false }); dataset.insert({ input: "foo", expected: { result: 42, error: null }, // Instead of `output` metadata: { model: "gpt-3.5-turbo" }, }); await dataset.flush(); for await (const record of dataset) { console.log(record.expected); // Instead of `record.output` } ``` ```python from braintrust import init_dataset # Currently `use_output` defaults to True, but this will change in a future version of Braintrust. dataset = init_dataset("My Dataset", use_output=False) dataset.insert( input="foo", expected=dict(result=42, error=None), # Instead of `output` metadata=dict(model="gpt-3.5-turbo"), ) dataset.flush() for record in dataset: print(record["expected"]) # Instead of `record["output"]` ``` * Support duplicate `Eval` names. * Fallback to `BRAINTRUST_API_KEY` if `OPENAI_API_KEY` is not set. * Throw an error if you use `experiment.log` and `experiment.start_span` together. * Add keyboard shortcuts (j/k/p/n) for navigation. * Increased tooltip size and delay for better usability. * Support more viewing modes: HTML, Markdown, and Text. ## Week of 2024-02-05 ![Playground](/docs/release-notes/ReleaseNotes-2023-02-05-Playground.gif) * Tons of improvements to the prompt playground: * A new "compact" view, that shows just one line per row, so you can quickly scan across rows. You can toggle between the two modes. * Loading indicators per cell * The run button transforms into a "Stop" button while you are streaming data * Prompt variables are now syntax highlighted in purple and use a monospace font * Tab now autocompletes * We no longer auto-create variables as you're typing (was causing more trouble than helping) * Slider params like `max_tokens` are now optional * Cloudformation now supports more granular RDS configuration (instance type, storage, etc) * **Support optional slider params** * Made certain parameters like `max_tokens` optional. * Accompanies pull request [https://github.com/braintrustdata/braintrust-proxy/pull/23](https://github.com/braintrustdata/braintrust-proxy/pull/23). * Lots of style improvements for tables. * Fixed filter bar styles. * Rendered JSON cell values using monospace type. * Adjusted margins for horizontally scrollable tables. * Implemented a smaller size for avatars in tables. * Deleting a prompt takes you back to the prompts tab ## Week of 2024-01-29 * New [REST API](/docs/api/spec). * [Cookbook](/docs/cookbook) of common use cases and examples. * Support for [custom models](/docs/guides/playground#custom-models) in the playground. * Search now works across spans, not just top-level traces. * Show creator avatars in the prompt playground * Improved UI breadcrumbs and sticky table headers ## Week of 2024-01-22 * UI improvements to the playground. * Added an example of [closed QA / extra fields](/docs/guides/evals#additional-fields). * New YAML parser and new syntax highlighting colors for data editor. * Added support for enabling/disabling certain git fields from collection (in org settings and the SDK). * Added new GPT-3.5 and 4 models to the playground. * Fixed scrolling jitter issue in the playground. * Made table fields in the prompt playground sticky. ## Week of 2024-01-15 * Added ability to download dataset as CSV * Added YAML support for logging and visualizing traces * Added JSON mode in the playground * Added span icons and improved readability * Enabled shift modifier for selecting multiple rows in Tables * Improved tables to allow editing expected fields and moved datasets to trace view ## Week of 2024-01-08 * Released new [Docker deployment method for self hosting](https://www.braintrustdata.com/docs/self-hosting/docker) * Added ability to manually score results in the experiment UI * Added comments and audit log in the experiment UI ## Week of 2024-01-01 * Added ability to upload dataset CSV files in prompt playgrounds * Published new [guide for tracing and logging your code](https://www.braintrustdata.com/docs/guides/tracing) * Added support to download experiment results as CSVs ## Week of 2023-12-25 * API keys are now scoped to organizations, so if you are part of multiple orgs, new API keys will only permit access to the org they belong to. * You can now search for experiments by any metadata, including their name, author, or even git metadata. * Filters are now saved in URL state so you can share a link to a filtered view of your experiments or logs. * Improve performance of project page by optimizing API calls. We made several cleanups and improvements to the low-level typescript and python SDKs (0.0.86). If you use the Eval framework, nothing should change for you, but keep in mind the following differences if you use the manual logging functionality: * Simplified the low-level tracing API (updated docs coming soon!) * The current experiment and current logger are now maintained globally rather than as async-task-local variables. This makes it much simpler to start tracing with minimal code modification. Note that creating experiments/loggers with `withExperiment`/`withLogger` will now set the current experiment globally (visible across all async tasks) rather than local to a specific task. You may pass `setCurrent: false/set_current=False` to avoid setting the global current experiment/logger. * In python, the `@traced` decorator now logs the function input/output by default. This might interfere with code that already logs input/output inside the `traced` function. You may pass `notrace_io=True` as an argument to `@traced` to turn this logging off. * In typescript, the `traced` method can start spans under the global logger, and is thus async by default. You may pass `asyncFlush: true` to these functions to make the traced function synchronous. Note that if the function tries to trace under the global logger, it must also have `asyncFlush: true`. * Removed the `withCurrent`/`with_current` functions * In typescript, the `Span.traced` method now accepts `name` as an optional argument instead of a required positional param. This matches the behavior of all other instances of `traced`. `name` is also now optional in python, but this doesn't change the function signature. * `Experiments` and `Datasets` are now lazily-initialized, similar to `Loggers`. This means all write operations are immediate and synchronous. But any metadata accessor methods (`[Experiment|Logger].[id|name|project]`) are now async. * Undo auto-inference of `force_login` if `login` is invoked with different params than last time. Now `login` will only re-login if `forceLogin: true/force_login=True` is provided. ## Week of 2023-12-18 * Dropped the official 2023 Year-in-Review dashboard. Check out yours [here](/app/year-in-review)! ![2023 year in review](/blog/img/2023-summary.png) * Improved ergonomics for the Python SDK: * The `@traced` decorator will automatically log inputs/outputs * You no longer need to use context managers to scope experiments or loggers. * Enable skew protection in frontend deploys, so hopefully no more hard refreshes. * Added syntax highlighting in the sidepanel to improve readability. * Add `jsonl` mode to the eval CLI to log experiment summaries in an easy-to-parse format. ## Week of 2023-12-11 * Released new [trials](https://www.braintrustdata.com/docs/guides/evals#trials) feature to rerun each input multiple times and collect aggregate results for a more robust score. * Added ability to run evals in the prompt playground. Use your existing dataset and the autoevals functions to score playground outputs. * Released new version of SDK (0.0.81) including a small breaking change. When setting the experiment name in the `Eval` function, the `exprimentName` key pair should be moved to a top level argument. before: ``` Eval([eval_name], { ..., metadata: { experimentName: [experimentName] } }) ``` after: ``` Eval([eval_name], { ..., experimentName: [experimentName] }) ``` * Added support for Gemini and Mistral Platform in AI proxy and playground ## Week of 2023-12-4 * Enabled the prompt playground and datasets for free users * Added Together.ai models including Mixtral to AI Proxy * Turned prompts tab on organization view into a list * Removed data row limit for the prompt playground * Enabled configuration for dark mode and light mode in settings * Added automatic logging of a diff if an experiment is run on a repo with uncommitted changes ## Week of 2023-11-27 * Added experiment search on project view to filter by experiment name
![Experiment search and filtering on project view](/docs/release-notes/ReleaseNotes11-27-search.gif)
* Upgraded AI Proxy to support [tracking Prometheus metrics](https://github.com/braintrustdata/braintrust-proxy/blob/a31a82e6d46ff442a3c478773e6eec21f3d0ba69/apis/cloudflare/wrangler-template.toml#L19C1-L19C1) * Modified Autoevals library to use the [AI proxy](/docs/guides/proxy) * Upgraded Python braintrust library to parallelize evals * Optimized experiment diff view for performance improvements ## Week of 2023-11-20 * Added support for new Perplexity models (ex: pplx-7b-online) to playground * Released [AI proxy](/docs/guides/proxy): access many LLMs using one API w/ caching * Added [load balancing endpoints](/docs/guides/proxy#load-balancing) to AI proxy * Updated org-level view to show projects and prompt playground sessions * Added ability to batch delete experiments * Added support for Claude 2.1 in playground ## Week of 2023-11-13 * Made experiment column resized widths persistent * Fixed our libraries including Autoevals to work with OpenAI’s new libraries
![Added OpenAI function calling in the prompt playground](/docs/release-notes/ReleaseNotes-2023-11-functions.gif)
* Added support for function calling and tools in our prompt playground * Added tabs on a project page for datasets, experiments, etc. ## Week of 2023-11-06 * Improved selectors for diffing and comparison modes on experiment view * Added support for new OpenAI models (GPT4 preview, 3.5turbo-1106) in playground * Added support for OS models (Mistral, Codellama, Llama2, etc.) in playground using Perplexity's APIs ## Week of 2023-10-30 * Improved experiment sidebar to be fully responsive and resizable * Improved tooltips within the web UI * Multiple performance optimizations and bug fixes ## Week of 2023-10-23 * [Improved prompt playground variable handling and visualization](/docs/release-notes/ReleaseNotes-2023-10-PromptPlaygroundVar.mp4) * Added time duration statistics per row to experiment summaries ![ReleaseNotes-2023-10-dataset.png](/docs/release-notes/ReleaseNotes-2023-10-TimeDurationExperiments.png) * Multiple performance optimizations and bug fixes ## Week of 2023-10-16 * [Launched new tracing feature: log and visualize complex LLM chains and executions.](/docs/guides/evals#tracing) * Added a new “text-block” prompt type in the playground that just returns a string or variable back without a LLM call (useful for chaining prompts and debugging) * Increased default # of rows per page from 10 to 100 for experiments * UI fixes and improvements for the side panel and tooltips * The experiment dashboard can be customized to show the most relevant charts ## Week of 2023-10-09 * Performance improvements related to user sessions ## Week of 2023-10-02 * All experiment loading HTTP requests are 100-200ms faster * The prompt playground now supports autocomplete * Dataset versions are now displayed on the datasets page ![ReleaseNotes-2023-10-dataset.png](/docs/release-notes/ReleaseNotes-2023-10-dataset.png) * Projects in the summary page are now sorted alphabetically * Long text fields in logged data can be expanded into scrollable blocks * [We evaluated the Alpaca evals leaderboard in Braintrust](https://www.braintrustdata.com/app/braintrustdata.com/p/Alpaca-Evals) * [New tutorial for finetuning GPT3.5 and evaluating with Braintrust](https://colab.research.google.com/drive/10KIXBHjZ0VUc-zN79_cuVeKy9ZiUQy4M?usp=sharing) ## Week of 2023-09-18 * The Eval framework is now supported in Python! See the updated [evals guide](/docs/guides/evals) for more information: ```python from braintrust import Eval from autoevals import LevenshteinScorer Eval( "Say Hi Bot", data=lambda: [ { "input": "Foo", "expected": "Hi Foo", }, { "input": "Bar", "expected": "Hello Bar", }, ], # Replace with your eval dataset task=lambda input: "Hi " + input, # Replace with your LLM call scores=[LevenshteinScorer], ) ``` * Onboarding and signup flow for new users * Switch product font to Inter ## Week of 2023-09-11 * Big performance improvements for registering experiments (down from \~5s to \<1s). Update the SDK to take advantage of these improvements. * New graph shows aggregate accuracy between experiments for each score. ![Score Comparison Chart](/docs/release-notes/ReleaseNotes-2023-09-Comparison.png) * Throw errors in the prompt playground if you reference an invalid variable. * A significant backend database change which significantly improves performance while reducing costs. Please contact us if you have not already heard from us about upgrading your deployment. * No more record size constraints (previously, strings could be at most 64kb long). * New autoevals for numeric diff and JSON diff ## Week of 2023-09-05 * You can duplicate prompt sessions, prompts, and dataset rows in the prompt playground. * You can download prompt sessions as JSON files (including the prompt templates, prompts, and completions). * You can adjust model parameters (e.g. temperature) in the prompt playground. * You can publicly share experiments (e.g. [Alpaca Evals](https://www.braintrustdata.com/app/braintrustdata.com/p/Alpaca-Evals/GPT4-w-metadata-claudegraded?c=llama2-70b-w-metadata-claudegraded)). * Datasets now support editing, deleting, adding, and copying rows in the UI. * There is no longer a 64KB limit on strings. ## Week of 2023-08-28 * The prompt playground is now live! We're excited to get your feedback as we continue to build this feature out. See [the docs](/docs/guides/playground) for more information. ![Sync Playground](/docs/release-notes/ReleaseNotes-2023-08-Playground.gif) ## Week of 2023-08-21 * A new chart shows experiment progress per score over time. ![Experiment Progress](/docs/release-notes/ReleaseNotes-2023-08-ExperimentProgress.png) * The [eval CLI](/docs/guides/evals) now supports `--watch`, which will automatically re-run your evaluation when you make changes to your code. * You can now edit datasets in the UI. ![Edit Dataset](/docs/release-notes/ReleaseNotes-2023-08-EditDataset.gif) ## Week of 2023-08-14 * Introducing datasets! You can now upload datasets to Braintrust and use them in your experiments. Datasets are versioned, and you can use them in multiple experiments. You can also use datasets to compare your model's performance against a baseline. Learn more about [how to create and use datasets in the docs](/docs/guides/datasets). * Fix several performance issues in the SDK and UI. ## Week of 2023-08-07 * Complex data is now substantially more performant in the UI. Prior to this change, we ran schema inference over the entire `input`, `output`, `expected`, and `metadata` fields, which could result in complex structures that were slow and difficult to work with. Now, we simply treat these fields as `JSON` types. * The UI updates in real-time as new records are logged to experiments. * Ergonomic improvements to the SDK and CLI: * The JS library is now Isomorphic and supports both Node.js and the browser. * The Evals CLI warns you when no files match the `.eval.[ts|js]` pattern. ## Week of 2023-07-31 * You can now break down scores by metadata fields: ![Grouped Score Chart](/docs/release-notes/ReleaseNotes-2023-07-Group-Chart.png) * Improve performance for experiment loading (especially complex experiments). Prior to this change, you may have seen experiments take 30s+ occasionally or even fail. To enable this, you'll need to update your CloudFormation. * Support for renaming and deleting experiments: ![Rename Delete Menu](/docs/release-notes/ReleaseNotes-2023-07-Rename-Delete.png) * When you expand a cell in detail view, the row is now highlighted: ![Highlight Row](/docs/release-notes/ReleaseNotes-2023-08-TableSelected.png) ## Week of 2023-07-24 * A new [framework](/docs/guides/evals) for expressing evaluations in a much simpler way: ```js #skip-compile import { Eval } from "braintrust"; import { Factuality } from "autoevals"; Eval("My Evaluation", { data: () => [ { input: "Which country has the highest population?", expected: "China", meta: { type: "question" }, }, ], task: (input) => callModel(input), scores: [Factuality], }); ``` Besides being much easier than the logging SDK, this framework sets the foundation for evaluations that can be run automatically as your code changes, built and run in the cloud, and more. We are very excited about the use cases it will open up! * `inputs` is now `input` in the SDK (>= 0.0.23) and UI. You do not need to make any code changes, although you should gradually start using the `input` field instead of `inputs` in your SDK calls, as `inputs` is now deprecated and will eventually be removed. * Improved diffing behavior for nested arrays. ## Week of 2023-07-17 * A couple of SDK updates (>= v0.0.21) that allow you to update an existing experiment `init(..., update=True)` and specify an id in `log(..., id='my-custom-id')`. These tools are useful for running an experiment across multiple processes, tasks, or machines, and idempotently logging the same record (identified by its `id`). * Note: If you have Braintrust installed in your own cloud environment, make sure to update the CloudFormation (available at [https://braintrust-cf.s3.amazonaws.com/braintrust-latest.yaml](https://braintrust-cf.s3.amazonaws.com/braintrust-latest.yaml)). * Tables with lots and lots of columns are now visually more compact in the UI: *Before:* ![Table before](/docs/release-notes/ReleaseNotes-2023-07-Table-Before.png) *After:* ![Table after](/docs/release-notes/ReleaseNotes-2023-07-Table-After.png) ## Week of 2023-07-10 * A new [Node.js SDK](/docs/libs/nodejs) ([npm](https://www.npmjs.com/package/braintrust)) which mirrors the [Python SDK](/docs/reference/libs/python). As this SDK is new, please let us know if you run into any issues or have any feedback. If you have Braintrust installed in your own cloud environment, make sure to update the CloudFormation (available at [https://braintrust-cf.s3.amazonaws.com/braintrust-latest.yaml](https://braintrust-cf.s3.amazonaws.com/braintrust-latest.yaml)) to include some functionality the Node.js SDK relies on. You can do this in the AWS console, or by running the following command (with the `braintrust` command included in the Python SDK). ```bash braintrust install api --update-template ``` * You can now swap the primary and comparison experiment with a single click. ![Swap experiments](/docs/release-notes/ReleaseNotes-2023-07-Swap.gif) * You can now compare `output` vs. `expected` within an experiment. ![Diff output and expected](/docs/release-notes/ReleaseNotes-2023-07-Diff.gif) * Version 0.0.19 is out for the SDK. It is an important update that throws an error if your payload is larger than 64KB in size. ## Week of 2023-07-03 * Support for real-time updates, using Redis. Prior to this, Braintrust would wait for your data warehouse to sync up with Kafka before you could view an experiment, often leading to a minute or two of time before a page loads. Now, we cache experiment records as your experiment is running, making experiments load instantly. To enable this, you'll need to update your CloudFormation. * New settings page that consolidates team, installation, and API key settings. You can now invite team members to your Braintrust account from the "Team" page. ![Settings Page](/docs/release-notes/ReleaseNotes-2023-07-Settings.png) * The experiment page now shows commit information for experiments run inside of a git repository. ![Git info](/docs/release-notes/ReleaseNotes-2023-07-git-info.png) ## Week of 2023-06-26 * Experiments track their git metadata and automatically find a "base" experiment to compare against, using your repository's base branch. * The Python SDK's [`summarize()`](/docs/libs/python#summarize) method now returns an [`ExperimentSummary`](/docs/libs/python#experimentsummary-objects) object with score differences against the base experiment (v0.0.10). * Organizations can now be "multi-tenant", i.e. you do not need to install in your cloud account. If you start with a multi-tenant account to try out Braintrust, and decide to move it into your own account, Braintrust can migrate it for you. ## Week of 2023-06-19 * New scatter plot and histogram insights to quickly analyze scores and filter down examples. ![Scatter Plot](/docs/release-notes/ReleaseNotes-2023-06-Scatter.gif) * API keys that can be set in the SDK (explicitly or through an environment variable) and do not require user login. Visit the settings page to create an API key. * Update the braintrust Python SDK to [version 0.0.6](https://pypi.org/project/braintrust/0.0.6/) and the CloudFormation template ([https://braintrust-cf.s3.amazonaws.com/braintrust-latest.yaml](https://braintrust-cf.s3.amazonaws.com/braintrust-latest.yaml)) to use the new API key feature. ## Week of 2023-06-12 * New `braintrust install` CLI for installing the CloudFormation * Improved performance for event logging in the SDK * Auto-merge experiment fields with different types (e.g. `number` and `string`) ## Week of 2023-06-05 * [Tutorial guide + notebook](/docs/start) * Automatically refresh cognito tokens in the Python client * New filter and sort operators on the experiments table: * Filter experiments by changes to scores (e.g. only examples with a lower score than another experiment) * Custom SQL filters * Filter and sort bubbles to visualize/clear current operations * \[Alpha] SQL query explorer to run arbitrary queries against one or more experiments
SQL Explorer --- file: ./content/docs/cookbook/index.mdx meta: { "title": "Cookbook" } # Cookbook This cookbook, inspired by [OpenAI's cookbook](https://cookbook.openai.com/), is a collection of recipes for common use cases of [Braintrust](/). Each recipe is an open source self-contained example, hosted on [GitHub](https://github.com/braintrustdata/braintrust-cookbook). We welcome community contributions and aspire for the cookbook to be a collaborative, living, breathing collection of best practices for building high quality AI products. {recipes .sort((a, b) => new Date(b.date) - new Date(a.date)) .map((recipe, idx) => { const slug = encodeURIComponent(recipe.urlPath); return ( ); })} --- file: ./content/docs/guides/access-control.mdx meta: { "title": "Access control" } # Access control Braintrust has a robust and flexible access control system. It's possible to grant user permissions at both the organization level as well as scoped to individual objects within Braintrust (projects, experiments, logs, datasets, prompts, and playgrounds). ## Permission groups The core concept of Braintrust's access control system is the permission group. Permission groups are collections of users that can be granted specific permissions. Braintrust has three pre-configured Permission Groups that are scoped to the organization. 1. **Owners** - Unrestricted access to the organization, its data, and its settings. Can add, modify, and delete projects and all other resources. Can invite and remove members and can manage group membership. 2. **Engineers** - Can access, create, update, and delete projects and all resources within projects. Cannot invite or remove members or manage access to resources. 3. **Viewers** - Can access projects and all resources within projects. Cannot create, update, or delete any resources. Cannot invite or remove members or manage access to resources. If your access control needs are simple and you do not need to restrict access to individual projects, these ready-made permission groups may be all that you need. A new user can be added to one of these three groups when you invite them to your organization. ![Built-in Permission Groups](./access-control/built-in-permission-groups.png) ## Creating custom permission groups In addition to the built-in permission groups, it's possible to create your own groups as well. To do so, go to the 'Permission groups' page of Settings and click on the 'Create permission group' button. Give your group a name and a description and then click 'Create'. ![Create group](./access-control/create-group.png) To set organization-level permissions for your new group, find the group in the groups list and click on the Permissions button. ![Custom group permissions](./access-control/custom-group-permissions.png) The 'Manage Access' permission should be granted judiciously as it is a super-user permission. It gives the user the ability to add and remove permissions, thus any user with 'Manage Access' gains the ability to grant all other permissions to themselves. \ \ The 'Manage Settings' permission grants users the ability to change organization-level settings like the API URL. ## Project scoped permissions To limit access to a specific project, create a new permission group from the Settings page. ![Project level permissions](./access-control/create-project-level.png) Navigate to the Configuration page of that project, and click on the Permissions link in the context menu. ![Project level permissions](./access-control/project-level-permissions.png) Search for your group by typing in the text input at the top of the page, and then click the pencil icon next to the group to set permissions. ![Search for group](./access-control/search-for-group.png) Set the project-level permissions for your group and click Save. ![Set project level permissions](./access-control/set-project-level-permissions.png) ## Object scoped permissions To limit access to a particular object (experiment, dataset, log, prompt, or playground) within a project, first create a permission group for those users on the 'Permission groups' section of Settings. ![Create experiment level group](./access-control/create-experiment-level-group.png) Next, navigate to the Configuration page of the project that holds that object and grant the group 'Read' permission at the project level. This will allow users in that group to navigate to the project in the Braintrust UI. ![Experiment level project permissions](./access-control/experiment-level-project-permissions.png) ![Setting project permissions for experiment](./access-control/read-on-project-for-your-experiment.png) Finally, navigate to your object and select Permissions from the context menu in the top-right of that object's page. ![Experiment level project permissions](./access-control/experiment-level-permissions-link.png) Find the permission group via the search input, and click the pencil icon to set permissions for the group. ![Experiment level find group](./access-control/experiment-level-find-group.png) Set the desired permissions for the group scoped to this specific object. ![Experiment level find group](./access-control/experiment-level-set-permissions.png) ## API support To automate the creation of permission groups and their access control rules, you can use the Braintrust API. For more information on using the API to manage permission groups, check out the [API reference for groups](/docs/reference/api/Groups#list-groups) and for [permissions](/docs/reference/api#list-acls). --- file: ./content/docs/guides/api.mdx meta: { "title": "API walkthrough" } # API walkthrough The Braintrust REST API is available via an OpenAPI spec published at [https://github.com/braintrustdata/braintrust-openapi](https://github.com/braintrustdata/braintrust-openapi). This guide walks through a few common use cases, and should help you get started with using the API. Each example is implemented in a particular language, for legibility, but the API itself is language-agnostic. To learn more about the API, see the full [API spec](/docs/api/spec). If you are looking for a language-specific wrapper over the bare REST API, we support several different [languages](/docs/reference/api#api-wrappers). ## Running an experiment ```python #skip-test #foo import os from uuid import uuid4 import requests API_URL = "https://api.braintrust.dev/v1" headers = {"Authorization": "Bearer " + os.environ["BRAINTRUST_API_KEY"]} if __name__ == "__main__": # Create a project, if it does not already exist project = requests.post(f"{API_URL}/project", headers=headers, json={"name": "rest_test"}).json() print(project) # Create an experiment. This should always be new experiment = requests.post( f"{API_URL}/experiment", headers=headers, json={"name": "rest_test", "project_id": project["id"]} ).json() print(experiment) # Log some stuff for i in range(10): resp = requests.post( f"{API_URL}/experiment/{experiment['id']}/insert", headers=headers, json={"events": [{"id": uuid4().hex, "input": 1, "output": 2, "scores": {"accuracy": 0.5}}]}, ) if not resp.ok: raise Exception(f"Error: {resp.status_code} {resp.text}: {resp.content}") ``` ## Fetching experiment results Let's say you have a [human review](/docs/guides/human-review) workflow and you want to determine if an experiment has been fully reviewed. You can do this by running a [Braintrust query language (BTQL)](/docs/reference/btql) query: ```sql from: experiment('') measures: sum("My review score" IS NOT NULL) AS reviewed, count(1) AS total filter: is_root -- Only count traces, not spans ``` To do this in Python, you can use the `btql` endpoint: ```python import os import requests API_URL = "https://api.braintrust.dev/" headers = {"Authorization": "Bearer " + os.environ["BRAINTRUST_API_KEY"]} def make_query(experiment_id: str) -> str: # Replace "response quality" with the name of your review score column return f""" from: experiment('{experiment_id}') measures: sum(scores."response quality" IS NOT NULL) AS reviewed, sum(is_root) AS total """ def fetch_experiment_review_status(experiment_id: str) -> dict: return requests.post( f"{API_URL}/btql", headers=headers, json={"query": make_query(experiment_id), "fmt": "json"}, ).json() EXPERIMENT_ID = "bdec1c5e-8c00-4033-84f0-4e3aa522ecaf" # Replace with your experiment ID print(fetch_experiment_review_status(EXPERIMENT_ID)) ``` ## Paginating a large dataset ```typescript // If you're self-hosting Braintrust, then use your stack's Universal API URL, e.g. // https://dfwhllz61x709.cloudfront.net export const BRAINTRUST_API_URL = "https://api.braintrust.dev"; export const API_KEY = process.env.BRAINTRUST_API_KEY; export async function* paginateDataset(args: { project: string; dataset: string; version?: string; // Number of rows to fetch per request. You can adjust this to be a lower number // if your rows are very large (e.g. several MB each). perRequestLimit?: number; }) { const { project, dataset, version, perRequestLimit } = args; const headers = { Accept: "application/json", "Accept-Encoding": "gzip", Authorization: `Bearer ${API_KEY}`, }; const fullURL = `${BRAINTRUST_API_URL}/v1/dataset?project_name=${encodeURIComponent( project, )}&dataset_name=${encodeURIComponent(dataset)}`; const ds = await fetch(fullURL, { method: "GET", headers, }); if (!ds.ok) { throw new Error( `Error fetching dataset metadata: ${ds.status}: ${await ds.text()}`, ); } const dsJSON = await ds.json(); const dsMetadata = dsJSON.objects[0]; if (!dsMetadata?.id) { throw new Error(`Dataset not found: ${project}/${dataset}`); } let cursor: string | null = null; while (true) { const body: string = JSON.stringify({ query: { from: { op: "function", name: { op: "ident", name: ["dataset"] }, args: [{ op: "literal", value: dsMetadata.id }], }, select: [{ op: "star" }], limit: perRequestLimit, cursor, }, fmt: "jsonl", version, }); const response = await fetch(`${BRAINTRUST_API_URL}/btql`, { method: "POST", headers, body, }); if (!response.ok) { throw new Error( `Error fetching rows for ${dataset}: ${ response.status }: ${await response.text()}`, ); } cursor = response.headers.get("x-bt-cursor") ?? response.headers.get("x-amz-meta-bt-cursor"); // Parse jsonl line-by-line const allRows = await response.text(); const rows = allRows.split("\n"); let rowCount = 0; for (const row of rows) { if (!row.trim()) { continue; } yield JSON.parse(row); rowCount++; } if (rowCount === 0) { break; } } } async function main() { for await (const row of paginateDataset({ project: "Your project name", // Replace with your project name dataset: "Your dataset name", // Replace with your dataset name perRequestLimit: 100, })) { console.log(row); } } main(); ``` ## Impersonating a user for a request User impersonation allows a privileged user to perform an operation on behalf of another user, using the impersonated user's identity and permissions. For example, a proxy service may wish to forward requests coming in from individual users to Braintrust without requiring each user to directly specify Braintrust credentials. The privileged service can initiate the request with its own credentials and impersonate the user so that Braintrust runs the operation with the user's permissions. To this end, all API requests accept a header `x-bt-impersonate-user`, which you can set to the ID or email of the user to impersonate. Currently impersonating another user requires that the requesting user has specifically been granted the `owner` role over all organizations that the impersonated user belongs to. This check guarantees the requesting user has at least the set of permissions that the impersonated user has. Consider the following code example for configuring ACLs and running a request with user impersonation. ```javascript // If you're self-hosting Braintrust, then use your stack's Universal API URL, e.g. // https://dfwhllz61x709.cloudfront.net export const BRAINTRUST_API_URL = "https://api.braintrust.dev"; export const API_KEY = process.env.BRAINTRUST_API_KEY; async function getOwnerRoleId() { const roleResp = await fetch( `${BRAINTRUST_API_URL}/v1/role?${new URLSearchParams({ role_name: "owner" })}`, { method: "GET", headers: { Authorization: `Bearer ${API_KEY}`, }, }, ); if (!roleResp.ok) { throw new Error(await roleResp.text()); } const roles = await roleResp.json(); return roles.objects[0].id; } async function getUserOrgInfo(orgName: string): Promise<{ user_id: string; org_id: string; }> { const meResp = await fetch(`${BRAINTRUST_API_URL}/api/self/me`, { method: "POST", headers: { Authorization: `Bearer ${API_KEY}`, }, }); if (!meResp.ok) { throw new Error(await meResp.text()); } const meInfo = await meResp.json(); const orgInfo = meInfo.organizations.find( (x: { name: string }) => x.name === orgName, ); if (!orgInfo) { throw new Error(`No organization found with name ${orgName}`); } return { user_id: meInfo.id, org_id: orgInfo.id }; } async function grantOwnershipRole(orgName: string) { const ownerRoleId = await getOwnerRoleId(); const { user_id, org_id } = await getUserOrgInfo(orgName); // Grant an 'owner' ACL to the requesting user on the organization. Granting // this ACL requires the user to have `create_acls` permission on the org, which // means they must already be an owner of the org indirectly. const aclResp = await fetch(`${BRAINTRUST_API_URL}/v1/acl`, { method: "POST", headers: { Authorization: `Bearer ${API_KEY}`, "Content-Type": "application/json", }, body: JSON.stringify({ object_type: "organization", object_id: org_id, user_id, role_id: ownerRoleId, }), }); if (!aclResp.ok) { throw new Error(await aclResp.text()); } } async function main() { if (!process.env.ORG_NAME || !process.env.USER_EMAIL) { throw new Error("Must specify ORG_NAME and USER_EMAIL"); } // This only needs to be done once. await grantOwnershipRole(process.env.ORG_NAME); // This will only succeed if the user being impersonated has permissions to // create a project within the org. const projectResp = await fetch(`${BRAINTRUST_API_URL}/v1/project`, { method: "POST", headers: { Authorization: `Bearer ${API_KEY}`, "Content-Type": "application/json", "x-bt-impersonate-user": process.env.USER_EMAIL, }, body: JSON.stringify({ name: "my-project", org_name: process.env.ORG_NAME, }), }); if (!projectResp.ok) { throw new Error(await projectResp.text()); } console.log(await projectResp.json()); } main(); ``` ```python import os import requests # If you're self-hosting Braintrust, then use your stack's Universal API URL, e.g. # https://dfwhllz61x709.cloudfront.net BRAINTRUST_API_URL = "https://api.braintrust.dev" API_KEY = os.environ["BRAINTRUST_API_KEY"] def get_owner_role_id(): resp = requests.get( f"{BRAINTRUST_API_URL}/v1/role", headers={"Authorization": f"Bearer {API_KEY}"}, params=dict(role_name="owner"), ) resp.raise_for_status() return resp.json()["objects"][0]["id"] def get_user_org_info(org_name): resp = requests.post( f"{BRAINTRUST_API_URL}/self/me", headers={"Authorization": f"Bearer {API_KEY}"}, ) resp.raise_for_status() me_info = resp.json() org_info = [x for x in me_info["organizations"] if x["name"] == org_name] if not org_info: raise Exception(f"No organization found with name {org_name}") return dict(user_id=me_info["id"], org_id=org_info["id"]) def grant_ownership_role(org_name): owner_role_id = get_owner_role_id() user_org_info = get_user_org_info(org_name) # Grant an 'owner' ACL to the requesting user on the organization. Granting # this ACL requires the user to have `create_acls` permission on the org, # which means they must already be an owner of the org indirectly. resp = requests.post( f"{BRAINTRUST_API_URL}/v1/acl", headers={"Authorization": f"Bearer {API_KEY}"}, body=dict( object_type="organization", object_id=user_org_info["org_id"], user_id=user_org_info["user_id"], role_id=owner_role_id, ), ) resp.raise_for_status() def main(): # This only needs to be done once. grant_ownership_role(os.environ["ORG_NAME"]) # This will only succeed if the user being impersonated has permissions to # create a project within the org. resp = requests.post( f"{BRAINTRUST_API_URL}/v1/project", headers={ "Authorization": f"Bearer {API_KEY}", "x-bt-impersonate-user": os.environ["USER_EMAIL"], }, json=dict( name="my-project", org_name=os.environ["ORG_NAME"], ), ) resp.raise_for_status() print(resp.json()) ``` ## Postman [Postman](https://www.postman.com/) is a popular tool for interacting with HTTP APIs. You can load Braintrust's API spec into Postman by simply importing the OpenAPI spec's URL ``` https://raw.githubusercontent.com/braintrustdata/braintrust-openapi/main/openapi/spec.json ``` ![Postman](./api/postman.gif) ## Tracing with the REST API SDKs In this section, we demonstrate the basics of logging with tracing using the language-specific REST API SDKs. The end result of running each example should be a single log entry in a project called `tracing_test`, which looks like the following: ![Tracing Test Screenshot](/docs/tracing-test-example.png) ```go package main import ( "context" "github.com/braintrustdata/braintrust-go" "github.com/braintrustdata/braintrust-go/shared" "github.com/google/uuid" "time" ) type LLMInteraction struct { input interface{} output interface{} } func runInteraction0(input interface{}) LLMInteraction { return LLMInteraction{ input: input, output: "output0", } } func runInteraction1(input interface{}) LLMInteraction { return LLMInteraction{ input: input, output: "output1", } } func getCurrentTime() float64 { return float64(time.Now().UnixMilli()) / 1000. } func main() { client := braintrust.NewClient() // Create a project, if it does not already exist project, err := client.Projects.New(context.TODO(), braintrust.ProjectNewParams{ Name: braintrust.F("tracing_test"), }) if err != nil { panic(err) } rootSpanId := uuid.NewString() client.Projects.Logs.Insert( context.TODO(), project.ID, braintrust.ProjectLogInsertParams{ Events: braintrust.F([]braintrust.ProjectLogInsertParamsEventUnion{ shared.InsertProjectLogsEventReplaceParam{ ID: braintrust.F(rootSpanId), Metadata: braintrust.F(map[string]interface{}{ "user_id": "user123", }), SpanAttributes: braintrust.F(braintrust.InsertProjectLogsEventReplaceSpanAttributesParam{ Name: braintrust.F("User Interaction"), }), Metrics: braintrust.F(braintrust.InsertProjectLogsEventReplaceMetricsParam{ Start: braintrust.F(getCurrentTime()), }), }, }), }, ) interaction0Id := uuid.NewString() client.Projects.Logs.Insert( context.TODO(), project.ID, braintrust.ProjectLogInsertParams{ Events: braintrust.F([]braintrust.ProjectLogInsertParamsEventUnion{ shared.InsertProjectLogsEventReplaceParam{ ID: braintrust.F(interaction0Id), ParentID: braintrust.F(rootSpanId), SpanAttributes: braintrust.F(braintrust.InsertProjectLogsEventReplaceSpanAttributesParam{ Name: braintrust.F("Interaction 0"), }), Metrics: braintrust.F(braintrust.InsertProjectLogsEventReplaceMetricsParam{ Start: braintrust.F(getCurrentTime()), }), }, }), }, ) interaction0 := runInteraction0("hello world") client.Projects.Logs.Insert( context.TODO(), project.ID, braintrust.ProjectLogInsertParams{ Events: braintrust.F([]braintrust.ProjectLogInsertParamsEventUnion{ braintrust.InsertProjectLogsEventMergeParam{ ID: braintrust.F(interaction0Id), IsMerge: braintrust.F(true), Input: braintrust.F(interaction0.input), Output: braintrust.F(interaction0.output), Metrics: braintrust.F(braintrust.InsertProjectLogsEventMergeMetricsParam{ End: braintrust.F(getCurrentTime()), }), }, }), }, ) interaction1Id := uuid.NewString() client.Projects.Logs.Insert( context.TODO(), project.ID, braintrust.ProjectLogInsertParams{ Events: braintrust.F([]braintrust.ProjectLogInsertParamsEventUnion{ braintrust.InsertProjectLogsEventReplaceParam{ ID: braintrust.F(interaction1Id), ParentID: braintrust.F(rootSpanId), SpanAttributes: braintrust.F(braintrust.InsertProjectLogsEventReplaceSpanAttributesParam{ Name: braintrust.F("Interaction 1"), }), Metrics: braintrust.F(braintrust.InsertProjectLogsEventReplaceMetricsParam{ Start: braintrust.F(getCurrentTime()), }), }, }), }, ) interaction1 := runInteraction1(interaction0.output) client.Projects.Logs.Insert( context.TODO(), project.ID, braintrust.ProjectLogInsertParams{ Events: braintrust.F([]braintrust.ProjectLogInsertParamsEventUnion{ braintrust.InsertProjectLogsEventMergeParam{ ID: braintrust.F(interaction1Id), IsMerge: braintrust.F(true), Input: braintrust.F(interaction1.input), Output: braintrust.F(interaction1.output), Metrics: braintrust.F(braintrust.InsertProjectLogsEventMergeMetricsParam{ End: braintrust.F(getCurrentTime()), }), }, }), }, ) client.Projects.Logs.Insert( context.TODO(), project.ID, braintrust.ProjectLogInsertParams{ Events: braintrust.F([]braintrust.ProjectLogInsertParamsEventUnion{ braintrust.InsertProjectLogsEventMergeParam{ ID: braintrust.F(rootSpanId), IsMerge: braintrust.F(true), Input: braintrust.F(interaction0.input), Output: braintrust.F(interaction1.output), Metrics: braintrust.F(braintrust.InsertProjectLogsEventMergeMetricsParam{ End: braintrust.F(getCurrentTime()), }), }, }), }, ) } ``` --- file: ./content/docs/guides/datasets.mdx meta: { "title": "Datasets" } # Datasets Datasets allow you to collect data from production, staging, evaluations, and even manually, and then use that data to run evaluations and track improvements over time. For example, you can use Datasets to: * Store evaluation test cases for your eval script instead of managing large JSONL or CSV files * Log all production generations to assess quality manually or using model graded evals * Store user reviewed (, ) generations to find new test cases In Braintrust, datasets have a few key properties: * **Integrated**. Datasets are integrated with the rest of the Braintrust platform, so you can use them in evaluations, explore them in the playground, and log to them from your staging/production environments. * **Versioned**. Every insert, update, and delete is versioned, so you can pin evaluations to a specific version of the dataset, rewind to a previous version, and track changes over time. * **Scalable**. Datasets are stored in a modern cloud data warehouse, so you can collect as much data as you want without worrying about storage or performance limits. * **Secure**. If you run Braintrust [in your cloud environment](/docs/guides/self-hosting), datasets are stored in your warehouse and never touch our infrastructure. ## Creating a dataset Records in a dataset are stored as JSON objects, and each record has three top-level fields: * `input` is a set of inputs that you could use to recreate the example in your application. For example, if you're logging examples from a question answering model, the input might be the question. * `expected` (optional) is the output of your model. For example, if you're logging examples from a question answering model, this might be the answer. You can access `expected` when running evaluations as the `expected` field; however, `expected` does not need to be the ground truth. * `metadata` (optional) is a set of key-value pairs that you can use to filter and group your data. For example, if you're logging examples from a question answering model, the metadata might include the knowledge source that the question came from. Datasets are created automatically when you initialize them in the SDK. ### Inserting records You can use the SDK to initialize and insert into a dataset: ```javascript import { initDataset } from "braintrust"; async function main() { const dataset = initDataset("My App", { dataset: "My Dataset" }); for (let i = 0; i < 10; i++) { const id = dataset.insert({ input: i, expected: { result: i + 1, error: null }, metadata: { foo: i % 2 }, }); console.log("Inserted record with id", id); } console.log(await dataset.summarize()); } main(); ``` ```python import braintrust dataset = braintrust.init_dataset(project="My App", name="My Dataset") for i in range(10): id = dataset.insert(input=i, expected={"result": i + 1, "error": None}, metadata={"foo": i % 2}) print("Inserted record with id", id) print(dataset.summarize()) ``` ### Updating records In the above example, each `insert()` statement returns an `id`. You can use this `id` to update the record using `update()`: ```javascript #skip-compile dataset.update({ id, input: i, expected: { result: i + 1, error: "Timeout" }, }); ``` ```python dataset.update(input=i, expected={"result": i + 1, "error": "Timeout"}, id=id) ``` The `update()` method applies a merge strategy: only the fields you provide will be updated, and all other existing fields in the record will remain unchanged. ### Deleting records You can delete records via code by `id`: ```javascript #skip-compile await dataset.delete(id); ``` ```python dataset.delete(id) ``` To delete an entire dataset, use the [API command](/docs/reference/api/Datasets#delete-dataset). ### Flushing In both TypeScript and Python, the Braintrust SDK flushes records as fast as possible and installs an exit handler that tries to flush records, but these hooks are not always respected (e.g. by certain runtimes, or if you `exit` a process yourself). If you need to ensure that records are flushed, you can call `flush()` on the dataset. ```javascript #skip-compile await dataset.flush(); ``` ```python dataset.flush() ``` ### Multimodal datasets You may want to store or process images in your datasets. There are currently three ways to use images in Braintrust: * Image URLs (most performant) * Base64 (least performant) * Attachments (easiest to manage, stored in Braintrust) If you're building a dataset of large images in Braintrust, we recommend using image URLs. This keeps your dataset lightweight and allows you to preview or process them without storing heavy binary data directly. If you prefer to keep all data within Braintrust, create a dataset of attachments instead. In addition to images, you can create datasets of attachments that have any arbitrary data type, including audio and PDFs. You can then [use these datasets in evaluations](/docs/guides/evals/write#attachments). ```typescript title="attachment_dataset.ts" import { Attachment, initDataset } from "braintrust"; import path from "node:path"; async function createPdfDataset(): Promise { const dataset = initDataset({ project: "Project with PDFs", dataset: "My PDF Dataset", }); for (const filename of ["example.pdf"]) { dataset.insert({ input: { file: new Attachment({ filename, contentType: "application/pdf", data: path.join("files", filename), }), }, }); } await dataset.flush(); } // Create a dataset with attachments. createPdfDataset(); ``` To invoke this script, run this in your terminal: ```bash npx tsx attachment_dataset.ts ``` ```python title="attachment_dataset.py" import os from typing import Any, Dict from braintrust import Attachment, init_dataset def create_pdf_dataset() -> None: """Create a dataset with attachments.""" dataset = init_dataset("Project with PDFs", "My PDF Dataset") for filename in ["example.pdf"]: dataset.insert( input={ "file": Attachment( filename=filename, content_type="application/pdf", # The file on your filesystem or the file's bytes. data=os.path.join("files", filename), ) }, # This is a toy example where we check that the file size is what we expect. expected=469513, ) dataset.flush() # Create a dataset with attachments. create_pdf_dataset() ``` To invoke this script, run this in your terminal: ```bash python attachment_dataset.py ``` Attachments are not yet supported in the playground. To explore images in the playground, we recommend using image URLs. ## Managing datasets in the UI In addition to managing datasets through the API, you can also manage them in the Braintrust UI. ### Viewing a dataset You can view a dataset in the Braintrust UI by navigating to the project and then clicking on the dataset. ![Dataset Viewer](/docs/guides/datasets/datasets.webp) From the UI, you can filter records, create new ones, edit values, and delete records. You can also copy records between datasets and from experiments into datasets. This feature is commonly used to collect interesting or anomalous examples into a golden dataset. #### Create custom columns When viewing a dataset, create [custom columns](/docs/guides/evals/interpret#create-custom-columns) to extract specific values from `input`, `expected`, or `metadata` fields. ### Creating a dataset The easiest way to create a dataset is to upload a CSV file. ![Upload CSV](./datasets/CSV-Upload.gif) ### Updating records Once you've uploaded a dataset, you can update records or add new ones directly in the UI. ![Edit record](./datasets/Edit-record.gif) ### Labeling records In addition to updating datasets through the API, you can edit and label them in the UI. Like experiments and logs, you can configure [categorical fields](/docs/guides/human-review#writing-to-expected-fields) to allow human reviewers to rapidly label records. This requires you to first [configure human review](/docs/guides/human-review#configuring-human-review) in the **Configuration** tab of your project. ![Write to expected](./human-review/expected-fields.png) ### Deleting records To delete a record, navigate to **Library → Datasets** and select the dataset. Select the check box next to the individual record you'd like to delete, and then select the **Trash** icon. You can follow the same steps to delete an entire dataset from the **Library > Datasets** page. ## Using a dataset in an evaluation You can use a dataset in an evaluation by passing it directly to the `Eval()` function. ```typescript import { initDataset, Eval } from "braintrust"; import { Levenshtein } from "autoevals"; Eval( "Say Hi Bot", // Replace with your project name { data: initDataset("My App", { dataset: "My Dataset" }), task: async (input) => { return "Hi " + input; // Replace with your LLM call }, scores: [Levenshtein], }, ); ``` ```python from braintrust import Eval, init_dataset from autoevals import Levenshtein Eval( "Say Hi Bot", # Replace with your project name data=init_dataset(project="My App", name="My Dataset"), task=lambda input: "Hi " + input, # Replace with your LLM call scores=[Levenshtein], ) ``` You can also manually iterate through a dataset's records and run your tasks, then log the results to an experiment. Log the `id`s to link each dataset record to the corresponding result. ```typescript import { initDataset, init, Dataset, Experiment } from "braintrust"; function myApp(input: any) { return `output of input ${input}`; } function myScore(output: any, rowExpected: any) { return Math.random(); } async function main() { const dataset = initDataset("My App", { dataset: "My Dataset" }); const experiment = init("My App", { experiment: "My Experiment", dataset: dataset, }); for await (const row of dataset) { const output = myApp(row.input); const closeness = myScore(output, row.expected); experiment.log({ input: row.input, output, expected: row.expected, scores: { closeness }, datasetRecordId: row.id, }); } console.log(await experiment.summarize()); } main(); ``` ```python import random import braintrust def my_app(input): return f"output of input {input}" def my_score(output, row_expected): return random.random() dataset = braintrust.init_dataset(project="My App", name="My Dataset") experiment = braintrust.init(project="My App", experiment="My Experiment", dataset=dataset) for row in dataset: output = my_app(row["input"]) closeness = my_score(output, row["expected"]) experiment.log( input=row["input"], output=output, expected=row["expected"], scores=dict(closeness=closeness), dataset_record_id=row["id"], ) print(experiment.summarize()) ``` You can also use the results of an experiment as baseline data for future experiments by calling the `asDataset()`/`as_dataset()` function, which converts the experiment into dataset format (`input`, `expected`, and `metadata`). ```typescript import { init, Eval } from "braintrust"; import { Levenshtein } from "autoevals"; const experiment = init("My App", { experiment: "my-experiment", open: true, }); Eval("My App", { data: experiment.asDataset(), task: async (input) => { return `hello ${input}`; }, scores: [Levenshtein], }); ``` ```python from braintrust import Eval, init from autoevals import Levenshtein experiment = braintrust.init( project="My App", experiment="my-experiment", open=True, ) Eval( "My App", data=experiment.as_dataset(), task=lambda input: input + 1, # Replace with your LLM call scores=[Levenshtein], ) ``` For a more advanced overview of how to use an experiment as a baseline for other experiments, see [hill climbing](/docs/guides/evals/write#hill-climbing). ## Logging from your application To log to a dataset from your application, you can simply use the SDK and call `insert()`. Braintrust logs are queued and sent asynchronously, so you don't need to worry about critical path performance. Since the SDK uses API keys, it's recommended that you log from a privileged environment (e.g. backend server), instead of client applications directly. This example walks through how to track / from feedback: ```javascript import { initDataset, Dataset } from "braintrust"; class MyApplication { private dataset: Dataset | undefined = undefined; async initApp() { this.dataset = await initDataset("My App", { dataset: "logs" }); } async logUserExample( input: any, expected: any, userId: string, orgId: string, thumbsUp: boolean, ) { if (this.dataset) { this.dataset.insert({ input, expected, metadata: { userId, orgId, thumbsUp }, }); } else { console.warn("Must initialize application before logging"); } } } ``` ```python from typing import Any import braintrust class MyApplication: def init_app(self): self.dataset = braintrust.init_dataset(project="My App", name="logs") def log_user_example(self, input: Any, expected: Any, user_id: str, org_id: str, thumbs_up: bool): if self.dataset: self.dataset.insert( input=input, expected=expected, metadata=dict(user_id=user_id, org_id=org_id, thumbs_up=thumbs_up), ) else: print("Must initialize application before logging") ``` ## Troubleshooting ### Downloading large datasets If you are trying to load a very large dataset, you may run into timeout errors while using the SDK. If so, you can [paginate](/docs/guides/api#downloading-a-dataset-using-pagination) through the dataset to download it in smaller chunks. --- file: ./content/docs/guides/human-review.mdx meta: { "title": "Human review" } # Human review Although Braintrust helps you automatically evaluate AI software, human review is a critical part of the process. Braintrust seamlessly integrates human feedback from end users, subject matter experts, and product teams in one place. You can use human review to evaluate/compare experiments, assess the efficacy of your automated scoring methods, and curate log events to use in your evals. As you add human review scores, your logs will update in real time. ![Human review label](./human-review/label.png) ## Configuring human review To set up human review, define the scores you want to collect in your project's **Configuration** tab. ![Human Review Configuration](./human-review/config-page.png) Select **Add human review score** to configure a new score. A score can be one of * Continuous number value between `0%` and `100%`, with a slider input control. * Categorical value where you can define the possible options and their scores. Categorical value options are also assigned a unique percentage value between `0%` and `100%` (stored as 0 to 1). * Free-form text where you can write a string value to the `metadata` field at a specified path. ![Create modal](./human-review/create-modal.png) Created human review scores will appear in the **Human review** section in every experiment and log trace in the project. Categorical scores configured to "write to expected" and free-form scores will also appear on dataset rows. ### Writing to expected fields You may choose to write categorical scores to the `expected` field of a span instead of a score. To enable this, check the **Write to expected field instead of score** option. There is also an option to **Allow multiple choice** when writing to the expected field. A numeric score will not be assigned to the categorical options when writing to the expected field. If there is an existing object in the expected field, the categorical value will be appended to the object. ![Write to expected](./human-review/expected-fields.png) In addition to categorical scores, you can always directly edit the structured output for the `expected` field of any span through the UI. ## Reviewing logs and experiments To manually review results from your logs or experiment, select a row to open trace view. There, you can edit the human review scores you previously configured. As you set scores, they will be automatically saved and reflected in the summary metrics. The process is the same whether you're reviewing logs or experiments. ### Leaving comments In addition to setting scores, you can also add comments to spans and update their `expected` values. These updates are tracked alongside score updates to form an audit trail of edits to a span. If you leave a comment that you want to share with a teammate, you can copy a link that will deeplink to the comment. ## Focused review mode If you or a subject matter expert is reviewing a large number of logs or experiments, you can use **Review** mode to enter a UI that's optimized specifically for review. To enter review mode, hit the "r" key or the expand () icon next to the **Human review** header in a span. In review mode, you can set scores, leave comments, and edit expected values. Review mode is optimized for keyboard navigation, so you can quickly move between scores and rows with keyboard shortcuts. You can also share a link to the review mode view with other team members, and they'll drop directly into review mode. ### Reviewing data that matches a specific criteria To easily review a subset of your logs or experiments that match a given criteria, you can filter using English or [BTQL](/docs/reference/btql#btql-query-syntax), then enter review mode. In addition to filters, you can use [tags](/docs/guides/logging#tags-and-queues) to mark items for `Triage`, and then review them all at once. You can also save any filters, sorts, or column configurations as views. Views give you a standardized place to see any current or future logs that match a given criteria, for example, logs with a Factuality score less than 50%. Once you create your view, you can enter review mode right from there. Reviewing is a common task, and therefore you can enter review mode from any experiment or log view. You can also re-enter review mode from any view to audit past reviews or update scores. ### Benefits over an annotation queue * Designed for optimal productivity: The combination of views and human review mode simplifies the review process with intuitive filters, reusable configurations, and keyboard navigation, enabling faster, more efficient log evaluation and feedback. * Dynamic and flexible views: Views dynamically update with new logs matching saved criteria, eliminating the need to set up and maintain complex automation rules. * Easy collaboration: Sharing review mode links allows for team collaboration without requiring intricate permissions or setup overhead. ## Filtering using feedback In the UI, you can filter on log events with specific scores by adding a filter using the filter button, like "Preference is greater than 75%", and then add the matching rows to a dataset for further investigation. You can also programmatically filter log events using the API using a query and the project ID: ```typescript #skip-compile await braintrust.projects.logs.fetch(projectId, { query }); ``` ```python braintrust.projects.logs.fetch("", "scores.Preference > 0.75") ``` This is a powerful way to utilize human feedback to improve your evals. ## Capturing end-user feedback The same set of updates — scores, comments, and expected values — can be captured from end-users as well. See the [Logging guide](/docs/guides/logs/write#user-feedback) for more details. --- file: ./content/docs/guides/index.mdx meta: { "title": "Guides", "description": "Step-by-step walkthroughs to help you accomplish a specific goal" } # Guides Guides are step-by-step walkthroughs to help you accomplish a specific goal in Braintrust. ## Core functionality ## Features ## Advanced usecases --- file: ./content/docs/guides/monitor.mdx meta: { "title": "Monitor", "metaTitle": "Monitor logs and experiments" } # Monitor page The **Monitor** page shows aggregate metrics data for both the logs and experiments in a given project. The included charts show values related to the selected time period for latency, token count, time to first token, cost, request count, and scores. ![Monitor page](/docs/guides/monitor/monitor-basic.png) ## Group by metadata Select the **Group** dropdown menu to group the data by specific metadata fields, including custom fields. ![Monitor page with group by](/docs/guides/monitor/monitor-group-by.png) ## Filter series Select the filter dropdown menu on any individual chart to apply filters. ## Select a timeframe Select a timeframe from the given options to see the data associated with that time period. ## Select to view traces Select a datapoint node in any of the charts to view the corresponding traces for that time period. ![Monitor page click to view traces](/docs/guides/monitor/monitor-click.png) --- file: ./content/docs/guides/playground.mdx meta: { "title": "Playground", "description": "Explore, compare, and evaluate prompts" } # Prompt playground The prompt playground is a tool for exploring, comparing, and evaluating prompts. The playground is deeply integrated within Braintrust, so you can easily to try out prompts with data from your [datasets](/docs/guides/datasets). The playground supports a wide range of models including the latest models from OpenAI, Anthropic, Mistral, Google, Meta, and more deployed on first and third party infrastructure. You can also configure it to talk to your own model endpoints and custom models, as long as they speak the OpenAI, Anthropic, or Google protocol. We're constantly working on improving the playground and adding new features. If you have any feedback or feature requests, please [reach out](/contact) to us. ## Creating a playground The playground organizes your work into sessions. A session is a saved and collaborative workspace that includes one or more prompts and is linked to a dataset. ![Empty Playground](/docs/guides/playground/empty-playground.webp) ### Sharing playgrounds Playgrounds are designed for collaboration and automatically synchronize in real-time. ![Sync Playground](/docs/guides/playground/sync-playground.gif) To share a playground, simply copy the URL and send it to your collaborators. Your collaborators must be members of your organization to see the session. You can invite users from the settings page. Playgrounds can also be shared publicly (read-only). ## Writing prompts Each prompt includes a model (e.g. GPT-4 or Claude-2), a prompt string or messages (depending on the model), and an optional set of parameters (e.g. temperature) to control the model's behavior. When click "Run" (or the keyboard shortcut Cmd/Ctrl+Enter), each prompt runs in parallel and the results stream into the grid below. ### Without a dataset By default, a playground is not linked to a dataset, and is self contained. This is similar to the behavior on other playgrounds (e.g. OpenAI's). This mode is a useful way to explore and compare self-contained prompts. ### With a dataset The real power of Braintrust comes from linking a playground to a dataset. You can link to an existing dataset or create a new one from the dataset dropdown: ![Dataset dropdown](/docs/guides/playground/prompt-dataset-dropdown.webp) Once you link a dataset, you will see a new row in the grid for each record in the dataset. You can reference the data from each record in your prompt using the `input`, `expected`, and `metadata` variables. The playground uses [mustache](https://mustache.github.io/) syntax for templating: ![Prompt with dataset](/docs/guides/playground/prompt-with-dataset.webp) Each value can be arbitrarily complex JSON, e.g. ![Prompt with JSON data](/docs/guides/playground/prompt-with-dataset-json.webp) ### Multimodal prompts You can also add images to your prompts by selecting the image icon in the input field. Images can be accessed via URLs, base64 encoded images as strings, or variables that contain an image. ![Multimodal prompt](/docs/guides/playground/multimodal-prompt.png) ### Referencing outputs Each prompt can reference the output of other prompts in the session (e.g. `output_a`). This is useful for validating outputs or chaining prompts together. For example, we can add a grading prompt to the previous example to verify that the output matches an expected value: ![Prompt with output](/docs/guides/playground/prompt-with-output.webp) ### Exporting prompt code The playground makes it easy to export your prompts as code that you can run through the [AI proxy](/docs/guides/proxy). Select the code icon () next to any chat-based prompt to get code snippets in TypeScript, Python, or cURL. The generated code includes all the prompt configuration, including the model, messages, and any additional parameters you've set. ## Custom models To configure custom models, see the [Custom models](/docs/guides/proxy#custom-models) section of the proxy docs. Endpoint configurations, like custom models, are automatically picked up by the playground. --- file: ./content/docs/guides/projects.mdx meta: { "title": "Projects", "description": "Create and configure projects" } # Projects A project is analogous to an AI feature in your application. Projects contain all [experiments](/docs/guides/evals), [logs](/docs/guides/logging), [datasets](/docs/guides/datasets) and [playgrounds](/docs/guides/playground) for the feature. For example, a project might contain: * An experiment that tests the performance of a new version of a chatbot * A dataset of customer support conversations * A prompt that guides the chatbot's responses * A tool that helps the chatbot answer customer questions * A scorer that evaluates the chatbot's responses * Logs that capture the chatbot's interactions with customers ## Project configuration Projects can also house configuration settings that are shared across the project. ### Tags Braintrust supports tags that you can use throughout your project to curate logs, datasets, and even experiments. You can filter based on tags in the UI to track various kinds of data across your application, and how they change over time. Tags can be created in the **Configuration** tab by selecting **Add tag** and entering a tag name, selecting a color, and adding an optional description. Create tag For more information about using tags to curate logs, see the [logging guide](/docs/guides/logging#tags-and-queues). ### Human review You can define scores and labels for manual human review, either as feedback from your users (through the API) or directly through the UI. Scores you define on the **Configuration** page will be available in every experiment and log in your project. To create a new score, select **Add human review score** and enter a name and score type. You can add multiple options and decide if you want to allow writing to the expected field instead of the score, or multiple choice. Create human review score To learn more about human review, check out the [full guide](/docs/guides/human-review). ### Aggregate scores Aggregate scores are formulas that combine multiple scores into a single value. This can be useful for creating a single score that represents the overall experiment. To create an aggregate score, select **Add aggregate score** and enter a name, formula, and description. Braintrust currently supports three types of aggregate scores: Add aggregate score Braintrust currently supports three types of aggregate scores: * **Weighted average** - A weighted average of selected scores. * **Minimum** - The minimum value among the selected scores. * **Maximum** - The maximum value among the selected scores. To learn more about aggregate scores, check out the [experiments guide](/docs/guides/evals/interpret#aggregate-weighted-scores). ### Online scoring Braintrust supports server-side online evaluations that are automatically run asynchronously as you upload logs. To create an online evaluation, select **Add rule** and input the rule name, description, and which scorers and sampling rate you'd like to use.