Why AI Text-to-SQL Returns Wrong Numbers (And How to Catch It)
Text-to-SQL rarely throws an error when it's wrong. It returns a clean number built on the wrong join, the wrong filter, or a guessed metric definition. Here are the failure modes and how to catch them before the number ships.
AI text-to-SQL returns wrong numbers because it usually fails silently: the query runs, returns a clean number, and looks right, while quietly using the wrong join, the wrong filter, or a guessed definition of a term like “active user” or “revenue.” There is no error message, no red text, nothing to flag. The SQL is valid; the answer is not. That is the dangerous case, and it is far more common than the broken-query case. A query that errors out gets fixed. A query that returns $2,148,330 when the right answer is $1,904,512 gets pasted into a board deck. Catching this is not about a smarter model. It is about clarification, grounding, and making the SQL auditable before anyone trusts the number.
The four ways a number goes wrong
Most silent failures fall into four buckets. None of them throw an error.
The wrong join. Ask for revenue per customer and the model joins orders to customers on a key that fans out, or picks the wrong one of three plausible foreign keys. The query runs. The total is inflated because rows got duplicated by the join, or deflated because an inner join dropped customers with no orders. A many-to-one that should have been one-to-one is the single most common cause of a confidently wrong total. Faulty joins on multi-table queries are consistently among the top error types in text-to-SQL evaluations.
The missing filter. The model nails the primary instruction and drops a secondary condition. You ask for “active enterprise customers this quarter” and it filters on plan = 'enterprise' but forgets status = 'active', or it includes internal test accounts because nobody told it to exclude them. The number is in the right ballpark, which is exactly why nobody catches it. Synthesizing several constraints into one correct query is a known weak point: the model satisfies the loudest part of the question and quietly skips the rest.
The wrong metric definition. “Revenue” can mean booked, recognized, or net of refunds. “Churn” can be logo churn or revenue churn. “Active” can mean logged in, or performed a key action, in the last 1, 7, or 30 days. The schema does not encode which one your company means. A generic model picks the textbook definition and writes perfectly valid SQL for a metric your finance team does not use. See metric definition for why this layer is where so much goes wrong, and why teams push these recipes into a semantic layer so there is one answer.
The guessed ambiguity. Real questions are underspecified. “How’s revenue this quarter?” leaves out calendar versus fiscal, gross versus net, and which product lines count. A bare model resolves all of that silently, in a single direction, with no signal that it made three judgment calls on your behalf. The guess is plausible. That is the whole problem.
Why these failures are invisible
Three properties of text-to-SQL conspire to hide the wrong answer.
The query is syntactically valid, so the database is happy. A wrong join and a right join are both legal SQL. The engine has no opinion about whether you meant to deduplicate. Validity is not correctness, and the only check most tools run is validity.
The number is plausible. Wrong totals from a fan-out join or a dropped filter are usually in the same order of magnitude as the right one. Nobody sanity-checks a number that looks reasonable. A NULL or a -1 gets caught. A revenue figure that is 13% too high does not.
And the SQL is hidden. Many chat-style analytics tools show you the answer, not the query. You cannot audit a join you cannot see. This is the core tension in conversational analytics: the chat surface that makes a number easy to ask for is the same surface that makes the number hard to verify. Even when the SQL is shown, if the answer arrives without the metric definition it used or the assumptions it made, you would have to reverse-engineer the logic to check it, and nobody does that for a number that already looks fine.
There is a fourth, quieter property: non-determinism. Ask the same ambiguous question twice and you can get two different queries and two different numbers, because the model resolved the ambiguity differently each time. If a metric moves between Monday and Tuesday and the underlying data did not change, the generation did. For more on the underlying task, see text-to-SQL.
How to catch a wrong number before it ships
The fixes are not exotic. They are the difference between a demo and something you would put in front of a CFO.
Clarify instead of guessing. The single highest-leverage change is making the system ask when a question is ambiguous. “Do you mean booked or recognized revenue?” is a better response than a confident wrong number. This is exactly what the hardest public benchmark rewards. On BIRD-Interact, 600 deliberately ambiguous business questions, a system is scored on doing the right thing when the question is underspecified, including asking back, not just on emitting SQL. Systems that only emit SQL score far lower than systems that know when to clarify, which is most of the accuracy story.
Ground the model in your definitions, not the textbook’s. A model that retrieves your metric definitions before writing SQL stops guessing what “active” means. Retrieving the right context at query time is what RAG does for SQL, and it closes most of the gap. We walk through what to retrieve and how to structure it in grounding an LLM in business context. The short version: pull the relevant schema slice, the matching metric definitions, and the business-context docs, then inject them into the prompt so the model writes SQL against your conventions instead of its priors.
Validate the query against the live schema before running it. A cheap pre-execution check catches a whole class of silent failures: do the columns exist, do the joins resolve to real keys, does the grain make sense? Hallucinated join keys are among the most common preventable failures, and a schema check finds them before the number is computed. Column lineage sharpens this: it tells you which upstream columns a metric actually depends on, so you can verify the query touches the right ones.
Make the SQL the deliverable, not a hidden step. Every answer should arrive with the SQL attached, plus the metric definition it used and any assumptions it made. That turns auditing from a reverse-engineering exercise into a glance. An analyst can read the join, see the filter, confirm the definition, and either trust the number or fix it in seconds. A number you can audit is a number you can defend.
A practical way to catch the wrong-join case yourself: before trusting a total, check whether the join changed the row count.
-- If a join is supposed to be one-to-one, the row count
-- should not jump. A fan-out join silently inflates totals.
SELECT
(SELECT count(*) FROM orders WHERE order_date >= '2026-04-01') AS base_rows,
(SELECT count(*)
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.order_date >= '2026-04-01') AS joined_rows;
-- base_rows != joined_rows means the join is duplicating or
-- dropping rows, and any SUM() downstream is wrong.
The same discipline applies to a suspicious metric: re-run it with the filter spelled out explicitly and compare. If excluding internal accounts moves the number, the original query was missing that filter.
The number is a property of the system, not the model
The uncomfortable truth is that the same frontier model produces both the wrong number and the right one. What changes is whether it clarified, what it was grounded in, and whether the SQL was validated and surfaced for audit. A bare model on a real warehouse misses more of the hard questions than it gets right. A model wrapped in clarification, grounding, and verification is not. When you evaluate any text-to-SQL tool, ask it your real questions the way you actually ask them, ambiguously, against your schema, expecting your definitions, and then read the SQL it wrote. The tool comparison and the head-to-head pages break the contenders down on exactly this axis.
How Datost handles this
Datost is built to refuse the silent wrong answer. Before generating any SQL, it grounds every question in three sources: your warehouse schema, your metric definitions, and your business-context docs (PRDs, runbooks, prior Slack threads). When a question is genuinely ambiguous, it asks back instead of guessing. Every answer ships with the SQL attached and the definition it used, so an analyst can audit the join, the filter, and the metric in one glance. That grounding-plus-clarification is most of why the same frontier model scores around 33% alone on BIRD-Interact and 75.2% inside Datost. It is also why Datost can join your warehouse with CRM, billing, product analytics, and ticketing in a single query and still show its work. See what Datost does and why Datost for how the proactive side catches wrong numbers before anyone asks the question.