Self-hosting Braintrust
Braintrust supports self-hosting through its unique hybrid architecture. In this guide, we outline important considerations and trade-offs for the various self-hosting options, along with how we can most effectively support you.
Overview
In general, there are two ways to self-host Braintrust:
- Using our official infrastructure packages (Terraform, CloudFormation1)
- Using our docker containers
We strongly recommend using our Terraform module because it's kept up-to-date with best practices, mirrors our fully hosted offering (proven at scale), and minimizes configuration issues. This is important because when troubleshooting performance or operational issues, we first need to understand what unique characteristics of your deployment might be contributing factors. If you use our Terraform modules, there are entire classes of issues that simply cannot exist. For example, if an API instance is using too many resources, this will not cause downstream issues, because API requests run via AWS lambda and are isolated.
A common piece of feedback we hear is that the Terraform module uses practices that differ from your specific infrastructure. While that may be true, we’d ask that you consider two points before rejecting it:
- A common practice is to run Braintrust inside a separate AWS account, so that it does not clutter or otherwise affect your main account. In many cases, this is enough to alleviate concerns about its design.
- If there are small tweaks that would make it work well in your infrastructure, it is significantly easier to fork (and even upstream) changes to the Terraform than it is to reinvent the infra yourself from scratch.
If you choose not to use our Terraform module, we will still support you, but you are now equally responsible for the uptime, security, and performance of your deployment as we are. Expect to put in significantly more work to keep your system running well.
Azure, Google Cloud Platform (GCP)
We now have a terraform template available for Azure and plan to build one for GCP. If this is of interest to you, please reach out.
Kubernetes (k8s)
Our AWS and Azure terraform submodules can be used in combination with existing k8s clusters as an option for hosting compute services. We provide a helm chart for deploying our core services in k8s and submodules for deploying data stores such as PostgreSQL and Redis. We plan to support k8s for the long term. In AWS, we still do not recommend using k8s because you must handle auto scaling yourself, rather than using the built-in scaling from lambda functions.
Roles and responsibilities
When you self-host, uptime becomes a shared responsibility between your team and ours. It is our responsibility to respond quickly when you have issues, collaboratively resolve them with you, and fix bugs/improve quality so that you encounter fewer issues in the future. It is your responsibility to follow our documentation, assign infrastructure resources on your team, and make sure that in the event of an incident, you have staff who is familiar with Braintrust and can work with our team to share context and resolve issues.
If you use our Terraform, we can help you resolve issues much more efficiently, since we can rule out infrastructure configuration as a root cause and assume you have deployed with best practices. Our Terraform template also has built-in support for enabling temporary, secure remote access for our staff.
If you use our Docker containers, you should be prepared to:
- Collect logs and make them available to us upon request
- Make sure one or more people on your team can access the docker containers, database (via
psql
), and storage buckets to perform ad-hoc checks and updates. - Ensure your on-call resources are familiar with Braintrust.
- If possible, give us temporary remote access when required to resolve complex issues. We understand this is not always possible, but if not, it’s important to make sure that someone on your team can access the system.
Our product and the AI space as a whole is evolving quickly and therefore not perfect — issues will happen. Therefore, it’s important that as a shared team, we make sure that you and your users are set up for success. Review this section and make sure that you can commit to what is required for the deployment option you pick. If you do not feel like you have the resources or bandwidth to commit to what’s required, then pick an easier option (e.g. Terraform rather than Docker, or fully hosted rather than Terraform).
Here’s a quick summary of expectations with each deployment option. There are more details in the sections that follow.
Terraform | Docker | |
---|---|---|
Configuration | Braintrust is responsible for making sure that your system is configured correctly. The Terraform module sizes things to operate at fairly high scale; however, there may be updates we need to make for the future. | You are responsible for reading the Docker documentation and configuring each container. You are also responsible for configuring the network, database, cache, and buckets. You should read the Terraform definitions to pick up on specific details (e.g. configuring NVME disk correctly for Brainstore). You should periodically check it for updates and reconcile those with your cluster. |
Updates | You can run terraform apply to update your cluster. | You are responsible for updating the containers periodically. We publish latest tags as well as versions (e.g. v1.4.0 ) on a ~weekly basis. Certain features will require infrastructure updates, so you should also keep an eye on the Terraform modules to see if you need to make those. |
Monitoring | You can enable a flag that shares your logs with us via CloudWatch. This allows us to automatically root-cause errors when you hit them, since we can just look at your logs. You can also opt into sending telemetry to our Datadog instance, and we can monitor the uptime of your system for you. | When you run into issues (e.g. Internal server errors), you’ll have to find them in your logs, and either share the log messages with us (or try to diagnose them yourself). You will also need to configure your own monitoring/alerting. We can provide guidance on what to look for to know something is wrong. We are currently working on a new revision to how we monitor all clusters (see below). |
Monitoring
We are currently working on a new iteration of our monitoring stack to better support self-hosted customers, both those who use our Terraform modules and Docker containers directly. This section describes how the planned feature set will work. The ETA for these features is July 14, 2025.
We have learned some important lessons from customers at varying scale:
- There are a handful of Braintrust-specific metrics (like Brainstore indexing lag) that tell us whether the system is struggling.
- Generally, when customers find issues, they immediately notify us and we work together on gathering context from their logs. The time between identifying an issue and getting us the relevant bits of information is usually a major contributor to downtime.
- Many issues, if identified proactively or quickly, can be resolved with little or no downtime. Ideally we can issue alerts as soon as we detect them.
To address these in the most efficient way possible, we are updating self-hosted Braintrust (Terraform, CloudFormation, and Docker) to start automatically sending telemetry back to our control plane. Each service now supports the following:
- Sending metrics, logs, and traces to Braintrust’s control plane. These requests are automatically authorized using your license key and therefore tied to your account. We will use this information to monitor the health of your deployment for you.
- While we are careful not to include any PII in logs or traces, we understand that you may not feel comfortable sending them to us. Each type of telemetry (metrics, logs, and traces) can be individually enabled or disabled.
- Metrics and traces can be sent to an OpenTelemetry destination of your choice (via HTTP). You can collect logs directly from the Lambda functions and Docker containers. We can provide high level guidance on what to look for, but we’re not planning to build integrations into specific observability tools for you to monitor the system.
In general, our approach will be to send ourselves enough information to (a) proactively notify ourselves and you when something is going wrong and (b) have the information required to diagnose issues without asking you to dig it up for us. We also plan to explore visualizing system health in an admin dashboard directly in our UI.
Upgrades
We release new versions of the data plane around once per week, often with incremental changes that improve the performance of Brainstore, add support for new UI features, and improve logging. You do not need to update this often, but here is a framework for how often you should update:
- Generally speaking, customers update about once per month
- You must update at least once per quarter
- If you are collaborating closely with us, e.g. on improving the performance of a query, you may need to update more often (each time we release a new version). If so, we’ll be in close contact with you about updating.
While upgrading, if you use our built-in Terraform modules, you simply need to run terraform apply
. This will make any relevant infra changes, as well as update the versions of Braintrust’s code. If you are deploying via Docker, then you should:
- Make sure to update both the API and Brainstore services at the same time (they do not need to be exactly synchronized, but should be updated on the same cadence).
- Periodically review the Terraform template and docs to make sure you are following best practices and have all of the necessary infrastructure components in place.
Remote access
There are occasionally issues that will require ad-hoc debugging or running manual commands against the container, Postgres database, or storage buckets to repair the state of the system. Customers who give us remote access (as needed) have experienced much faster resolutions when such issues occur, because our team can connect directly and resolve things. We understand that this is not always possible, but if not, we kindly ask you to factor this into your uptime calculations. Said another way, if uptime of Braintrust is a key metric for you, then you should strongly consider making remote access, as needed, available to our team.
If you cannot set up remote access, then make sure that you can swiftly spin up a terminal that can perform the following:
- Access the containers directly (
docker exec
, update them, view logs, restart them, view host metrics like CPU, network, memory, and disk utilization) - Run SQL queries against Postgres
- Connect to Redis
- Run read, write, and list commands against your storage buckets
It’s important that your on-call staff have basic familiarity with Braintrust and the ability to perform all of these operations.
1 CloudFormation was our original infrastructure option and is not recommended for new deployments, but will continue to be supported for existing users. All of the trade-offs and considerations that apply to Terraform apply to CloudFormation as well.