Reference

BTQL query syntax

Braintrust Query Language (BTQL) is a precise, SQL-like syntax for querying your experiments, logs, and datasets. You can use BTQL to filter and run more complex queries to analyze your data.

Why use BTQL?

BTQL gives you precise control over your AI application data. You can:

  • Filter and search for relevant logs and experiments
  • Create consistent, reusable queries for monitoring
  • Build automated reporting and analysis pipelines
  • Write complex queries to analyze model performance

Query structure

BTQL queries follow a familiar SQL-like structure that lets you define what data you want and how to analyze it:

select: *                           -- Fields to retrieve
from: project_logs('<PROJECT_ID>')  -- Data source (identifier or function call)
filter: scores.Factuality > 0.8     -- Filter conditions
sort: created desc                  -- Sort order
limit: 100                          -- Result size limit
cursor: '<CURSOR>'                  -- Pagination token

Each clause serves a specific purpose:

  • select: choose which fields to retrieve
  • from: specify the data source - can be an identifier (like project_logs) or a function call (like experiment("id"))
  • filter: define conditions to filter the data
  • sort: set the order of results (asc or desc)
  • limit and cursor: control result size and enable pagination

You can also use dimensions, measures, and pivot instead of select for aggregation queries.

Understanding traces and spans

When querying experiments or logs, BTQL returns results at the trace level by default. Each row represents a complete interaction with all its associated spans (like prompt construction, model call, post-processing). This helps you understand the full context and flow of each interaction.

For example:

select: *
from: project_logs('my-project')
limit: 10

This will return 10 complete traces, not 10 individual spans.

Available operators

Here are the operators you can use in your queries:

-- Comparison operators
=           -- Equal to (alias for 'eq')
!=          -- Not equal to (alias for 'ne', can also use '<>')
>           -- Greater than (alias for 'gt')
<           -- Less than (alias for 'lt')
>=          -- Greater than or equal (alias for 'ge')
<=          -- Less than or equal (alias for 'le')
 
-- Null operators
IS NULL     -- Check if value is null
IS NOT NULL -- Check if value is not null
ISNULL      -- Unary operator to check if null
ISNOTNULL   -- Unary operator to check if not null
 
-- Text matching
LIKE        -- Case-sensitive pattern matching with SQL wildcards
NOT LIKE    -- Negated case-sensitive pattern matching
ILIKE       -- Case-insensitive pattern matching with SQL wildcards
NOT ILIKE   -- Negated case-insensitive pattern matching
MATCH       -- Full-word semantic search (faster but requires exact word matches, e.g. 'apple' won't match 'app')
NOT MATCH   -- Negated full-word semantic search
 
-- Array operators
INCLUDES    -- Check if array/object contains value (alias: CONTAINS)
NOT INCLUDES -- Check if array/object does not contain value
 
-- Logical operators
AND         -- Both conditions must be true
OR          -- Either condition must be true
NOT         -- Unary operator to negate condition
 
-- Arithmetic operators
+           -- Addition (alias: add)
-           -- Subtraction (alias: sub)
*           -- Multiplication (alias: mul)
/           -- Division (alias: div)
%           -- Modulo (alias: mod)
-x          -- Unary negation (alias: neg)

Available functions

Here are all the functions you can use in any context (select, filter, dimensions, measures):

-- Date/time functions
second(timestamp)          -- Extract second from timestamp
minute(timestamp)         -- Extract minute from timestamp
hour(timestamp)          -- Extract hour from timestamp
day(timestamp)           -- Extract day from timestamp
week(timestamp)          -- Extract week from timestamp
month(timestamp)         -- Extract month from timestamp
year(timestamp)          -- Extract year from timestamp
current_timestamp()      -- Get current timestamp (alias: now())
current_date()          -- Get current date
 
-- String functions
lower(text)                       -- Convert text to lowercase
upper(text)                       -- Convert text to uppercase
 
-- Array functions
len(array)                        -- Get length of array
contains(array, value)            -- Check if array contains value (alias: includes)
 
-- Null handling functions
coalesce(val1, val2, ...)        -- Return first non-null value
nullif(val1, val2)               -- Return null if val1 equals val2
least(val1, val2, ...)           -- Return smallest non-null value
greatest(val1, val2, ...)        -- Return largest non-null value
 
-- Type conversion
round(number, precision)          -- Round to specified precision
 
-- Aggregate functions (only in measures)
count(expr)                       -- Count number of rows
sum(expr)                        -- Sum numeric values
avg(expr)                        -- Calculate mean of numeric values
min(expr)                        -- Find minimum value
max(expr)                        -- Find maximum value
percentile(expr, p)              -- Calculate percentile (p between 0 and 1)

Field access

BTQL provides flexible ways to access nested data in arrays and objects:

-- Object field access
metadata.model             -- Access nested object field
metadata."field name"      -- Access field with spaces
metadata.'field-name'      -- Access field with special characters
 
-- Array access (0-based indexing)
tags[0]                    -- First element
tags[-1]                   -- Last element
 
-- Combined array and object access
metadata.models[0].name    -- Field in first array element
responses[-1].tokens       -- Field in last array element
spans[0].children[-1].id   -- Nested array traversal

Array indices are 0-based, and negative indices count from the end (-1 is the last element).

Select clause

The select clause determines which fields appear in your results. You can select specific fields, compute values, or use * to get everything:

-- Get specific fields
select:
  metadata.model as model,
  scores.Factuality as score,
  created as timestamp
from: project_logs('my-project')

Working with expressions

Transform your data directly in the select clause:

select:
  -- Simple field access
  metadata.model,
 
  -- Computed values
  metrics.tokens > 1000 as is_long_response,
 
  -- Conditional logic
  (scores.Factuality > 0.8 ? "high" : "low") as quality
from: project_logs('my-project')

Using functions

Transform values and create meaningful aliases for your results:

select:
  -- Date/time functions
  day(created) as date,
  hour(created) as hour,
 
  -- Numeric calculations
  round(scores.Factuality, 2) as rounded_score
from: project_logs('my-project')

Dimensions and measures

Instead of select, you can use dimensions and measures to group and aggregate data:

-- Analyze model performance over time
dimensions:
  metadata.model as model,
  day(created) as date
measures:
  count(1) as total_calls,
  avg(scores.Factuality) as avg_score,
  percentile(latency, 0.95) as p95_latency
from: project_logs('my-project')

Aggregate functions

Common aggregate functions for measures:

-- Example using various aggregates
dimensions: metadata.model as model
measures:
  count(1) as total_rows,                -- Count rows
  sum(metrics.tokens) as total_tokens,   -- Sum values
  avg(scores.Factuality) as avg_score,   -- Calculate mean
  min(latency) as min_latency,           -- Find minimum
  max(latency) as max_latency,           -- Find maximum
  percentile(latency, 0.95) as p95       -- Calculate percentiles
from: project_logs('my-project')

Pivot results

The pivot clause transforms your results to make comparisons easier by converting rows into columns. This is especially useful when comparing metrics across different categories or time periods.

Syntax:

pivot: <measure1>, <measure2>, ...

Here are some examples:

-- Compare model performance metrics across models
dimensions: day(created) as date
measures:
  avg(scores.Factuality) as avg_factuality,
  avg(metrics.tokens) as avg_tokens,
  count(1) as call_count
from: project_logs('my-project')
pivot: avg_factuality, avg_tokens, call_count
 
-- Results will look like:
-- {
--   "date": "2024-01-01",
--   "gpt-4_avg_factuality": 0.92,
--   "gpt-4_avg_tokens": 150,
--   "gpt-4_call_count": 1000,
--   "gpt-3.5-turbo_avg_factuality": 0.85,
--   "gpt-3.5-turbo_avg_tokens": 120,
--   "gpt-3.5-turbo_call_count": 2000
-- }
-- Compare metrics across time periods
dimensions: metadata.model as model
measures:
  avg(scores.Factuality) as avg_score,
  percentile(latency, 0.95) as p95_latency
from: project_logs('my-project')
pivot: avg_score, p95_latency
 
-- Results will look like:
-- {
--   "model": "gpt-4",
--   "0_avg_score": 0.91,
--   "0_p95_latency": 2.5,
--   "1_avg_score": 0.89,
--   "1_p95_latency": 2.8,
--   ...
-- }
-- Compare tag distributions across models
dimensions: tags[0] as primary_tag
measures: count(1) as tag_count
from: project_logs('my-project')
pivot: tag_count
 
-- Results will look like:
-- {
--   "primary_tag": "quality",
--   "gpt-4_tag_count": 500,
--   "gpt-3.5-turbo_tag_count": 300
-- }

Pivot columns are automatically named by combining the dimension value and measure name. For example, if you pivot by metadata.model and a model named "gpt-4" to measure avg_score, the name becomes gpt-4_avg_score.

Unpivot

The unpivot clause transforms columns into rows, which is useful when you need to analyze arbitrary scores and metrics without specifying each score name. This is particularly helpful when working with dynamic sets of metrics or when you need to know all possible score names in advance.

-- Convert wide format to long format for arbitrary scores
dimensions: created as date
measures: count(1) as count
from: project_logs('my-project')
unpivot: count as (score_name, score_value)
 
-- Results will look like:
-- {
--   "date": "2024-01-01",
--   "score_name": "Factuality",
--   "score_value": 0.92
-- },
-- {
--   "date": "2024-01-01",
--   "score_name": "Coherence",
--   "score_value": 0.88
-- }

Conditional expressions

BTQL supports conditional logic using the ternary operator (? :):

-- Basic conditions
select:
  (scores.Factuality > 0.8 ? "high" : "low") as quality,
  (error IS NOT NULL ? -1 : metrics.tokens) as valid_tokens
from: project_logs('my-project')
-- Nested conditions
select:
  (scores.Factuality > 0.9 ? "excellent" :
   scores.Factuality > 0.7 ? "good" :
   scores.Factuality > 0.5 ? "fair" : "poor") as rating
from: project_logs('my-project')
-- Use in calculations
select:
  (metadata.model = "gpt-4" ? metrics.tokens * 2 : metrics.tokens) as adjusted_tokens,
  (error IS NULL ? metrics.latency : 0) as valid_latency
from: project_logs('my-project')

Time intervals

BTQL supports intervals for time-based operations:

-- Basic intervals
select: *
from: project_logs('my-project')
filter: created > now() - interval 1 day
-- Multiple time conditions
select: *
from: project_logs('my-project')
filter:
  created > now() - interval 1 hour and
  created < now()
-- Examples with different units
select: *
from: project_logs('my-project')
filter:
  created > now() - interval 7 day and    -- Last week
  created > now() - interval 1 month      -- Last month

Filter clause

The filter clause lets you specify conditions to narrow down results. It supports a wide range of operators and functions:

filter:
  -- Simple comparisons
  scores.Factuality > 0.8 and
  metadata.model = "gpt-4" and
 
  -- Array operations
  tags includes "triage" and
 
  -- Text search
  input ILIKE '%question%' and
 
  -- Date ranges
  created > '2024-01-01' and
 
  -- Complex conditions
  (
    metrics.tokens > 1000 or
    metadata.is_production = true
  )

Note: Negative filters on tags (e.g., NOT tags includes "resolved") may not work as expected. Since tags are only applied to the root span of a trace, and queries return complete traces, negative tag filters will match child spans (which don't have tags) and return the entire trace. We recommend using positive tag filters instead.

Sort clause

The sort clause determines the order of results:

-- Sort by single field
sort: created desc
 
-- Sort by multiple fields
sort: scores.Factuality desc, created asc
 
-- Sort by computed values
sort: len(tags) desc

Limit and cursor

Control result size and implement pagination:

-- Basic limit
limit: 100
-- Pagination using cursor (only works without sort)
select: *
from: project_logs('<PROJECT_ID>')
limit: 100
cursor: '<CURSOR_TOKEN>'  -- From previous query response

Cursors are automatically returned in the response when using limit. However, cursors can only be used for pagination when no sort clause is specified. If you need sorted results, you'll need to implement offset-based pagination by using the last value from your sort field as a filter in the next query, as shown in the example above.

-- Offset-based pagination with sorting
-- Page 1 (first 100 results)
select: *
from: project_logs('<PROJECT_ID>')
sort: created desc
limit: 100
-- Page 2 (next 100 results)
select: *
from: project_logs('<PROJECT_ID>')
filter: created < '2024-01-15T10:30:00Z'  -- Last created timestamp from previous page
sort: created desc
limit: 100

API access

Access BTQL programmatically through our API:

curl -X POST https://api.braintrust.dev/btql \
  -H "Authorization: Bearer <YOUR_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{"query": "select: * | from: project_logs('"'<YOUR_PROJECT_ID>'"') | filter: tags includes '"'triage'"'"}'

The API accepts these parameters:

  • query (required): your BTQL query string
  • fmt: response format (json or parquet, defaults to json)
  • tz_offset: timezone offset in minutes for time-based operations
  • use_columnstore: enable columnstore for faster large queries
  • audit_log: include audit log data

For correct day boundaries, set tz_offset to match your timezone. For example, use 480 for US Pacific Standard Time.

Examples

Let's look at some real-world examples:

Tracking token usage

This query helps you monitor token consumption across your application:

from: project_logs('<YOUR_PROJECT_ID>')
filter: created > '<ISO_8601_TIME>'
dimensions: day(created) as time
measures:
  sum(metrics.total_tokens) as total_tokens,
  sum(metrics.prompt_tokens) as input_tokens,
  sum(metrics.completion_tokens) as output_tokens
sort: time asc

The response shows daily token usage:

{
  "time": "2024-11-09T00:00:00Z",
  "total_tokens": 100000,
  "input_tokens": 50000,
  "output_tokens": 50000
}

Model quality monitoring

Track model performance across different versions and configurations:

-- Compare factuality scores across models
dimensions:
  metadata.model as model,
  day(created) as date
measures:
  avg(scores.Factuality) as avg_factuality,
  percentile(scores.Factuality, 0.05) as p05_factuality,
  percentile(scores.Factuality, 0.95) as p95_factuality,
  count(1) as total_calls
filter: created > '2024-01-01'
sort: date desc, model asc
-- Find potentially problematic responses
select: *
from: project_logs('<PROJECT_ID>')
filter:
  scores.Factuality < 0.5 and
  metadata.is_production = true and
  created > now() - interval 1 day
sort: scores.Factuality asc
limit: 100

Error analysis

Identify and investigate errors in your application:

-- Error rate by model
dimensions:
  metadata.model as model,
  hour(created) as hour
measures:
  count(1) as total,
  sum(error IS NOT NULL ? 1 : 0) as errors,
  sum(error IS NOT NULL ? 1 : 0) / count(1) as error_rate
filter: created > now() - interval 1 day
sort: error_rate desc
-- Find common error patterns
dimensions:
  error.type as error_type,
  metadata.model as model
measures:
  count(1) as error_count,
  avg(metrics.latency) as avg_latency
filter:
  error IS NOT NULL and
  created > now() - interval 7 day
sort: error_count desc

Latency analysis

Monitor and optimize response times:

-- Track p95 latency by endpoint
dimensions:
  metadata.endpoint as endpoint,
  hour(created) as hour
measures:
  percentile(metrics.latency, 0.95) as p95_latency,
  percentile(metrics.latency, 0.50) as median_latency,
  count(1) as requests
filter: created > now() - interval 1 day
sort: hour desc, p95_latency desc
-- Find slow requests
select:
  metadata.endpoint,
  metrics.latency,
  metrics.tokens,
  input,
  created
from: project_logs('<PROJECT_ID>')
filter:
  metrics.latency > 5000 and  -- Requests over 5 seconds
  created > now() - interval 1 hour
sort: metrics.latency desc
limit: 20

Prompt analysis

Analyze prompt effectiveness and patterns:

-- Track prompt token efficiency
dimensions:
  metadata.prompt_template as template,
  day(created) as date
measures:
  avg(metrics.prompt_tokens) as avg_prompt_tokens,
  avg(metrics.completion_tokens) as avg_completion_tokens,
  avg(metrics.completion_tokens) / avg(metrics.prompt_tokens) as token_efficiency,
  avg(scores.Factuality) as avg_factuality
filter: created > now() - interval 7 day
sort: date desc, token_efficiency desc
-- Find similar prompts
select: *
from: project_logs('<PROJECT_ID>')
filter:
  input MATCH 'explain the concept of recursion' and
  scores.Factuality > 0.8
sort: created desc
limit: 10

Tag-based analysis

Use tags to track and analyze specific behaviors:

-- Monitor feedback patterns
dimensions:
  tags[0] as primary_tag,
  metadata.model as model
measures:
  count(1) as feedback_count,
  avg(scores.Factuality > 0.8 ? 1 : 0) as high_quality_rate
filter:
  tags includes 'feedback' and
  created > now() - interval 30 day
sort: feedback_count desc
-- Track issue resolution
select:
  created,
  tags,
  metadata.model,
  scores.Factuality,
  response
from: project_logs('<PROJECT_ID>')
filter:
  tags includes 'needs-review' and
  NOT tags includes 'resolved' and
  created > now() - interval 1 day
sort: scores.Factuality asc