BIRD-Interact
BIRD-Interact is a text-to-SQL benchmark of 600 deliberately ambiguous business questions across 22 realistic Postgres databases, published at ICLR 2026.
Also known as: BIRD-Interact benchmark · BIRD-Interact ICLR
BIRD-Interact is a text-to-SQL benchmark released by the University of Hong Kong and Google Cloud, accepted as an oral presentation at ICLR 2026 (top ~1-2% of submissions). It is the hardest public benchmark for AI data analysts because it is built around the failure mode most benchmarks ignore: ambiguity.
What’s in it
- 600 questions across 22 PostgreSQL databases drawn from realistic domains: crypto exchanges, solar panel telemetry, organ transplant records, Hulu streaming metadata.
- Each question is deliberately ambiguous — “find underperforming assets” when the schema has no “underperforming” column. The system has to either look up a metric definition or ask a clarifying question.
- Each database ships with a knowledge base of metric definitions and business rules the system can retrieve at query time.
- Grading uses the official BIRD-Interact evaluation code to check whether the SQL produces the right result.
The questions look easy until you try to write the SQL. The schemas are nasty — JSON columns containing the real data, inconsistent casing, columns called MARG_FORM next to columns called acctScope. This is what production warehouses actually look like, not the clean academic toy datasets older benchmarks use.
What the scores mean
- 33% — Claude Opus 4.6 used directly, without any grounding layer. This is the headline baseline cited in the BIRD-Interact paper for the frontier-model-alone approach.
- High 30s to mid 40s — AI features on most existing data tools land in this band. Hex’s Magic AI lands around 44%, Mode’s AI assist around 43%, Julius around 41%, Metabase’s AI assist around 40%, Sigma AI around 39%. Each wraps a frontier model in a thin retrieval layer over schema metadata, which buys a handful of percentage points over the bare model.
- 75.2% — Datost, on top of the same Claude Opus 4.6 model.
The gap is not the model. It is the system around the model — schema retrieval, metric definition grounding, and clarification before generating SQL.
Why this matters for buyers
If a vendor claims their AI data product is accurate but cannot point to a third-party benchmark, the claim is unverifiable. BIRD-Interact (and the broader BIRD family) is the gold standard right now. It is the right benchmark to ask vendors about during eval.
We wrote up our full BIRD-Interact methodology and results in the benchmark post, including which question families Datost still fails on and why.
- Text-to-SQL Text-to-SQL is the task of translating a natural-language question into a SQL query that runs against a database and returns the answer.
- Semantic Layer A semantic layer is a central definition store that maps human-readable business concepts (revenue, churn, MRR) to the underlying tables and SQL that compute them.
- Metric Definition A metric definition is the exact SQL or calculation that produces a business metric, plus the documented assumptions behind it.