How to Give an LLM the Business Context to Write Correct SQL

The mechanism that makes text-to-SQL accurate is grounding the model in three things before it writes a query: schema, metric definitions, and business context. Here's what to retrieve and how to structure it.

To give an LLM the business context to write correct SQL, retrieve and inject three things into the prompt before the model generates a query: the relevant subset of your warehouse schema (tables, columns, types, and join keys), your metric definitions (the exact recipes for revenue, churn, active user, and the rest of your terms), and your business context (PRDs, runbooks, and prior analyses that explain what the data means). The model already knows SQL syntax. What it doesn’t know is your warehouse and your conventions, and that gap is where the wrong answers come from. On BIRD-Interact, the hardest public text-to-SQL benchmark, a frontier model scores around 33% with no grounding. The same model inside a system built on this grounding pipeline scores 75.2%. The ~42-point gap is mostly the context.

Retrieval pulls only the relevant schema, metric definitions, and business context into the prompt before the model writes a query. That grounding, not a bigger model, is most of the jump from 33% to 75.2%.

Why “dump the whole schema in the prompt” fails

The naive approach is to paste your entire schema into the system prompt and ask the model to figure it out. It works for a toy database with five tables. On a real warehouse it falls apart, for a few reasons.

Context windows are finite and noisy. Real schemas have hundreds of tables and thousands of columns. Even when the whole thing fits, the irrelevant 95% dilutes the model’s attention. The schema-linking research is consistent here: dumping in the full schema adds noise that hurts the model’s ability to focus on the columns it actually needs, even when everything fits inside a large context window.

Naming is a trap. Warehouses are full of users, users_v2, dim_users, and users_staging. The model has no way to know that revenue lives in fct_invoices.amount_cents and not orders.total. It picks the plausible-looking one and is confidently wrong.

And the schema doesn’t carry meaning. A column named status tells the model nothing about which value means “active.” Types and names are syntax. The business meaning lives somewhere else.

The fix is retrieval, not volume. Pull the small slice of schema that’s actually relevant to the question, and pair it with the semantics the schema can’t express. That’s RAG applied to text-to-SQL.

The three context sources, ranked by impact

1. Schema (the structure)

Retrieve only the tables and columns relevant to the question. The research literature calls this step schema linking. Embed your table and column descriptions, then run similarity search against the question to pull the relevant subset. For each candidate table, include:

Column names and types.
A one-line description per column (“mrr_cents: monthly recurring revenue in cents, excludes one-time charges”).
Foreign keys and the canonical join paths between tables.
A few representative values for low-cardinality columns (status ∈ active, churned, trialing).

That last point matters more than people expect. Telling the model that status = 'active' is a real value stops it from inventing status = 'ACTIVE' or is_active = true.

2. Metric definitions (the meaning)

Schema tells the model where the data is. A metric definition tells it what the data means. This is the highest-leverage context you can add, because “revenue,” “active user,” and “churn” are precisely the terms a generic model gets wrong.

Structure each definition as a small, retrievable document:

metric: active_customer
description: A customer who logged a session in the last 7 days, excluding internal accounts.
sql: |
  SELECT DISTINCT s.customer_id
  FROM fct_sessions s
  JOIN dim_customers c ON c.id = s.customer_id
  WHERE s.event_at >= NOW() - INTERVAL '7 days'
    AND c.is_internal = FALSE
grain: per-customer
owner: growth-team
last_reviewed: 2026-05-01

Ask “how many active customers churned last month,” and the system retrieves this definition. Now the model writes SQL that respects the 7-day window and the internal-account exclusion, instead of joining naively against users and returning a number that’s directionally close but quietly wrong. If you keep definitions in a semantic layer, point at that as the source of truth. If they live in Notion or Markdown, retrieve from there.

3. Business context (the institutional knowledge)

The third layer is everything that isn’t schema or a formal metric but still changes the right answer. A PRD that explains a feature flag. A runbook noting that the EU region migrated warehouses in March. A Slack thread where someone established, months ago, that Q4 numbers exclude a specific enterprise pilot. This context lives in documents, not databases, and it’s what separates an answer that’s technically valid from one that’s actually correct. Retrieve it the same way you retrieve everything else: embed the docs, search against the question, inject the top hits.

How to wire it together

The pipeline is the standard five-step RAG loop, specialized for SQL:

Parse the question for entities and intent (“active customers,” “churned,” “last month”).
Retrieve the relevant schema subset, the matching metric definitions, and the top business-context docs.
Assemble the prompt from the retrieved context, the question, and the SQL dialect.
Generate the SQL, and have the model cite which definition and tables it used.
Validate the query against the real schema before running it (do the columns exist, do the joins resolve?), then surface the result for audit.

Two additions move this from a demo to something you’d run in production.

Clarify instead of guessing. If two metric definitions conflict (“MRR per finance” vs. “MRR per growth”), or the time grain is ambiguous, the system should ask rather than silently pick one. BIRD-Interact rewards exactly this behavior, which is why models that just emit SQL score so much lower than systems that know when to ask back.

Validate against live structure. Hallucinated join keys are the most common silent failure, and a cheap schema check before execution catches them. (Column lineage helps here too: it tells you which upstream columns a metric actually depends on, which sharpens both retrieval and validation.)

How Datost handles this

This grounding pipeline is, almost exactly, how Datost works. For every question, Datost retrieves across three sources (your warehouse schema, your metric definitions, and your business-context docs: PRDs, runbooks, prior Slack threads you’ve connected), assembles them into the prompt, and asks for clarification when the question is genuinely ambiguous. Every answer ships with the SQL and the cited definition attached, so an analyst can audit it and the next person can build on it.

That context layer is most of the accuracy gap on BIRD-Interact: the same frontier model scores ~33% alone and 75.2% inside Datost. It’s also why Datost can join your warehouse with CRM, billing, and product analytics in one query, then post the issue, the root cause, and the fix to Slack before anyone asks. If you want to see the grounding work on your own schema, start here.