How Accurate Is Text-to-SQL, Really? Spider, BIRD, and the Enterprise Cliff
An honest, benchmark-driven answer to whether AI text-to-SQL can be trusted. We walk Spider to BIRD to BIRD-Interact, explain the enterprise cliff, and show the head-to-head scores.
Text-to-SQL is accurate enough to trust on clean academic schemas, and far less accurate than the headlines suggest on real enterprise warehouses. On the Spider benchmark, frontier systems hit roughly 85% execution accuracy. On BIRD, which uses dirtier databases, top systems land around 75 to 82% against a human bar near 93%. On BIRD-Interact, the hardest public benchmark, a frontier model used alone scores about 33%. That last number is the one to care about, because your warehouse is ambiguous, messy, and full of business context the model has never seen. Call it the enterprise cliff. The distance between a benchmark question and a real one is exactly where text-to-SQL falls down, and the fall is steep.
The three benchmarks, in order of difficulty
An accuracy claim only means something relative to a benchmark. Three of them matter, and each one gets harder for a specific reason.
Spider: clean schemas, unambiguous questions
Spider is the original large-scale text-to-SQL benchmark, hand-labeled by Yale students. It has 10,181 questions over 200 databases spanning 138 domains, and it measures execution accuracy (run the predicted SQL, run the gold SQL, compare results).
Spider is hard in an academic sense: cross-domain generalization, nested queries, multi-table joins. But the schemas are clean and the questions are unambiguous. “How many singers are there?” maps to one obvious table, and the progress curve shows it. Execution accuracy climbed from about 53.5% in 2020 to roughly 85% by 2023, once GPT-4-class models and decomposition methods like DIN-SQL arrived (DIN-SQL reported 85.3% on the Spider test set). Spider is close to solved.
BIRD: dirty data, external knowledge
BIRD raised the bar by grounding the task in messier, larger databases. It has 12,751 question-SQL pairs over 95 databases across 37 professional domains, and it deliberately adds the friction Spider lacked: dirty values, the need for external knowledge to bridge a question and the schema, and a scoring component for query efficiency (the Valid Efficiency Score).
The gap shows up immediately. In the original BIRD paper, vanilla ChatGPT scored about 40% execution accuracy against a 92.96% human expert bar. As of 2026 the best pipelines reach the low 80s on the test set (the leading system, AskData with GPT-4o, reports about 81.95%), and they get there with elaborate multi-step orchestration rather than raw model capability. The lesson from BIRD is that real data, not query complexity, is what breaks text-to-SQL.
BIRD-Interact: ambiguity on purpose
BIRD-Interact, an oral presentation at ICLR 2026 created by the BIRD Team and Google Cloud (with the University of Hong Kong), is the hardest public benchmark because it targets the failure mode the others ignore: ambiguity. It has 600 deliberately ambiguous business questions across 22 realistic Postgres databases, things like crypto exchanges, solar telemetry, organ transplant records, and streaming metadata.
A question like “find the underperforming assets” has no underperforming column. The system has to either retrieve a metric definition or ask a clarifying question, and it gets scored on doing the right thing when the question is underspecified, not just on emitting SQL. That one design choice is why the scores fall off a cliff. We published our full methodology and results, including the question families we still fail.
The enterprise cliff: why scores collapse
Your warehouse looks nothing like Spider and everything like BIRD-Interact. Three things drag the score down the moment text-to-SQL leaves the lab.
The first is ambiguity. Real questions are underspecified. “How’s revenue this quarter?” leaves out whether revenue means booked, recognized, or net of refunds, and whether “this quarter” is calendar or fiscal. A bare model resolves that silently by guessing, and the guess is plausible enough that nobody catches it. The right move is to ask. Most systems never do. See text-to-SQL for why this is the step that fails.
The second is messy schemas. Production schemas have hundreds or thousands of objects with inconsistent naming, JSON columns holding the real data, and a MARG_FORM column sitting next to an acctScope column. The relevant table is rarely the obvious one. Pulling the right columns out of a real schema is a search problem, and the model can’t solve it from the question alone.
The third is business context. “Active user” excludes internal employees. “MRR” handles refunds a specific way. These rules live in PRDs, runbooks, and old Slack threads, never in the schema itself. Without them the SQL comes out syntactically perfect and semantically wrong. Encoding those rules is what a semantic layer and explicit metric definitions are for, and retrieving them at query time is what RAG does.
The pattern is consistent. The closer a benchmark gets to a real warehouse, the more the score depends on the system around the model rather than the model itself.
The head-to-head numbers on BIRD-Interact
This is the comparison that actually predicts production behavior. All figures are execution accuracy on BIRD-Interact, the benchmark built to look like your warehouse, scored with the official evaluation code.
| Approach | BIRD-Interact accuracy | What it does |
|---|---|---|
| Frontier model alone (Claude Opus 4.6, no grounding) | ~33% | Reads the question and the raw schema, generates SQL, guesses on ambiguity |
| Hex’s Magic AI | around 44% | Wraps a frontier model in a thin retrieval layer over schema metadata |
| Most BI/AI assistants (our measurement) | high 30s to mid 40s | Schema-aware prompting; a few points over the bare model |
| Datost | 75.2% | Grounds in schema + metric definitions + business context, and clarifies before generating |
It’s the same model in every row. The frontier model that scores ~33% on its own scores 75.2% inside Datost. A thin schema-retrieval wrapper buys a handful of points, which is how Hex’s Magic AI lands around 44%. The rest of the gap isn’t a smarter model. It’s grounding and clarification.
So, can you trust it?
Trust it the way the benchmarks suggest, which depends entirely on what you’re pointing it at.
- On clean, well-modeled schemas with unambiguous questions, text-to-SQL is reliable today. Spider-level accuracy is real.
- On real enterprise warehouses, a bare model or a thin AI assistant is a coin flip on the hard questions. Accuracy in the ~33 to 44% range is not trustworthy for decisions.
- The accuracy you get is a property of the system, not the model. So ask any vendor for a third-party benchmark number. If they can’t point to one, the claim is unverifiable.
When you evaluate a tool, ask it the question the way you actually ask it: ambiguously, against your real schema, expecting your own business definitions to be respected. That’s the test that matters.
How Datost handles this
Datost treats text-to-SQL as a system problem. Before generating any SQL, it grounds every question in three sources: your warehouse schema, your metric definitions, and your business context docs (PRDs, runbooks, prior Slack threads). When a question is ambiguous, it asks back instead of guessing. Every answer ships with the SQL attached so an analyst can audit it and the next person can build on it.
That grounding accounts for most of the gap between ~33% and 75.2% on BIRD-Interact. It’s also why Datost can join the warehouse with CRM, billing, product analytics, and ticketing in a single query instead of one clean table at a time. See why Datost for how the proactive side works: it watches metrics continuously and posts the issue, root cause, and fix in Slack before anyone asks. If you’re comparing tools, the Datost vs Hex breakdown puts the BIRD-Interact numbers side by side.