Glossary · BIRD-Interact

BIRD-Interact

BIRD-Interact is a text-to-SQL benchmark of 600 deliberately ambiguous business questions across 22 realistic Postgres databases, published at ICLR 2026.

Also known as: BIRD-Interact benchmark · BIRD-Interact ICLR

BIRD-Interact is a text-to-SQL benchmark released by the University of Hong Kong and Google Cloud, accepted as an oral presentation at ICLR 2026 (top ~1-2% of submissions). It is the hardest public benchmark for AI data analysts because it is built around the failure mode most benchmarks ignore: ambiguity.

What’s in it

  • 600 questions across 22 PostgreSQL databases drawn from realistic domains: crypto exchanges, solar panel telemetry, organ transplant records, Hulu streaming metadata.
  • Each question is deliberately ambiguous — “find underperforming assets” when the schema has no “underperforming” column. The system has to either look up a metric definition or ask a clarifying question.
  • Each database ships with a knowledge base of metric definitions and business rules the system can retrieve at query time.
  • Grading uses the official BIRD-Interact evaluation code to check whether the SQL produces the right result.

The questions look easy until you try to write the SQL. The schemas are nasty — JSON columns containing the real data, inconsistent casing, columns called MARG_FORM next to columns called acctScope. This is what production warehouses actually look like, not the clean academic toy datasets older benchmarks use.

What the scores mean

  • 33% — Claude Opus 4.6 used directly, without any grounding layer. This is the headline baseline cited in the BIRD-Interact paper for the frontier-model-alone approach.
  • High 30s to mid 40s — AI features on most existing data tools land in this band. Hex’s Magic AI lands around 44%, Mode’s AI assist around 43%, Julius around 41%, Metabase’s AI assist around 40%, Sigma AI around 39%. Each wraps a frontier model in a thin retrieval layer over schema metadata, which buys a handful of percentage points over the bare model.
  • 75.2% — Datost, on top of the same Claude Opus 4.6 model.

The gap is not the model. It is the system around the model — schema retrieval, metric definition grounding, and clarification before generating SQL.

Why this matters for buyers

If a vendor claims their AI data product is accurate but cannot point to a third-party benchmark, the claim is unverifiable. BIRD-Interact (and the broader BIRD family) is the gold standard right now. It is the right benchmark to ask vendors about during eval.

We wrote up our full BIRD-Interact methodology and results in the benchmark post, including which question families Datost still fails on and why.