Blog · June 5, 2026 · Datost

How to Calculate Cohort Retention in SQL

A practical, runnable SQL walkthrough for cohort retention: bucket users into signup-month cohorts, count who returns in each month, build the retention matrix, and read the curve.

To calculate cohort retention in SQL, you assign each user to a cohort (usually their signup month), count how many users from each cohort were active in each later period, then divide those counts by the cohort’s original size to get a retention percentage. The whole thing is a GROUP BY on two derived columns, the cohort month and the number of months elapsed since signup, wrapped in a window or join that recovers the cohort’s starting headcount. Below is a runnable query you can paste into Postgres, Snowflake, or BigQuery, plus notes on the one or two functions that differ between them.

The two tables you need

You need an events table (one row per user action, with a timestamp) and a way to find each user’s first action. Most warehouses have something like this:

-- events: one row per user action
-- user_id      | event_at
-- 'u_001'      | 2026-01-04 09:12:00
-- 'u_001'      | 2026-02-18 14:03:00
-- 'u_002'      | 2026-01-22 08:40:00

“Active” can mean a login, a session, an order, or whatever event your team agrees counts. That definition is the single most consequential decision in the whole query, and Step 1 covers how to pick it. If signup lives in a separate users table, swap the first-event logic for that table’s created_at.

Step 1: define the cohort and the active event

A cohort is the set of users who share a starting point. The usual choice is signup month. Derive each user’s cohort by truncating their first-ever event to the first of the month:

WITH first_seen AS (
  SELECT
    user_id,
    DATE_TRUNC('month', MIN(event_at)) AS cohort_month
  FROM events
  GROUP BY user_id
)

DATE_TRUNC('month', ts) returns the first day of that month, so every user who started in January 2026 gets the same cohort_month of 2026-01-01. That grouping is what keeps the January batch’s behavior separate from February’s instead of mushing them together. (For the conceptual background, see cohort analysis.)

Dialect note: Postgres and Snowflake use DATE_TRUNC('month', event_at). BigQuery reverses the argument order and drops the quotes: DATE_TRUNC(event_at, MONTH).

Step 2: compute the month offset for every active event

For each event, we need to know how many months after the user’s cohort start it happened. Month 0 (M0) is the signup month itself; M1 is the next month, and so on.

activity AS (
  SELECT
    f.cohort_month,
    e.user_id,
    -- whole months between cohort start and this event
    (EXTRACT(YEAR FROM e.event_at) - EXTRACT(YEAR FROM f.cohort_month)) * 12
      + (EXTRACT(MONTH FROM e.event_at) - EXTRACT(MONTH FROM f.cohort_month))
      AS month_offset
  FROM events e
  JOIN first_seen f USING (user_id)
)

I wrote the month difference out with EXTRACT on purpose: it works the same across all three warehouses and there’s no ambiguity. The shorter dialect-specific versions:

  • Postgres: there is no DATEDIFF. The EXTRACT expression above is the clean portable choice.
  • Snowflake: DATEDIFF('month', f.cohort_month, e.event_at)
  • BigQuery: DATE_DIFF(DATE(e.event_at), DATE(f.cohort_month), MONTH)

One subtlety: a month-grained diff counts boundary crossings, not 30-day blocks. The difference between Jan 28 and Feb 3 is 1, because the calendar month changed. That’s exactly what you want for monthly cohorts, since you’re bucketing by month, not by elapsed days.

Step 3: count distinct active users per cohort per offset

Now collapse the events down to one count per (cohort, month offset). Use COUNT(DISTINCT user_id) so a user who fires ten events in a month still counts once.

cohort_size AS (
  SELECT cohort_month, COUNT(DISTINCT user_id) AS total_users
  FROM first_seen
  GROUP BY cohort_month
),

retention AS (
  SELECT
    cohort_month,
    month_offset,
    COUNT(DISTINCT user_id) AS active_users
  FROM activity
  GROUP BY cohort_month, month_offset
)

cohort_size is the denominator, the headcount of each cohort at M0. retention is the numerator for every cell in the table.

Step 4: build the retention matrix

Join the counts to the cohort sizes and compute the percentage. The output is one row per cohort per month offset, the long form of the retention table.

SELECT
  r.cohort_month,
  c.total_users,
  r.month_offset,
  r.active_users,
  ROUND(100.0 * r.active_users / c.total_users, 1) AS retention_pct
FROM retention r
JOIN cohort_size c USING (cohort_month)
ORDER BY r.cohort_month, r.month_offset;

Note the 100.0. In Postgres, / between two integers does integer division and will silently floor your percentages to 0, so force a numeric type by multiplying with a decimal literal first. The integer-division trap bites everyone once. (BigQuery and Snowflake already return a decimal when you divide integers, but multiplying by 100.0 keeps the query portable, so write it once and it works everywhere.)

This produces rows like the following. The numbers below are illustrative sample output, not benchmark figures:

cohort_month | total_users | month_offset | active_users | retention_pct
2026-01-01   | 1200        | 0            | 1200         | 100.0
2026-01-01   | 1200        | 1            | 540          | 45.0
2026-01-01   | 1200        | 2            | 432          | 36.0
2026-02-01   | 1500        | 0            | 1500         | 100.0
2026-02-01   | 1500        | 1            | 690          | 46.0

Pivoting into the classic triangle

Analysts usually want cohorts as rows and M0, M1, M2… as columns. Pivot the long form with conditional aggregation, which runs on every warehouse:

SELECT
  cohort_month,
  MAX(CASE WHEN month_offset = 0 THEN retention_pct END) AS m0,
  MAX(CASE WHEN month_offset = 1 THEN retention_pct END) AS m1,
  MAX(CASE WHEN month_offset = 2 THEN retention_pct END) AS m2,
  MAX(CASE WHEN month_offset = 3 THEN retention_pct END) AS m3
FROM (/* the Step 4 query */) matrix
GROUP BY cohort_month
ORDER BY cohort_month;

M0 is always 100.0 by construction. Newer cohorts have NULL in the far-right columns because not enough calendar time has passed. That staircase of NULLs is the diagonal edge of the cohort triangle, not missing data.

A cohort retention triangle: rows are signup-month cohorts from January to May 2026, columns are months since signup M0 through M5. M0 is always 100 percent, retention decays across each row, and newer cohorts leave a staircase of empty cells where not enough time has passed. Cohort retention by signup month (%) M0M1M2M3M4M5 Jan 2026 100 45 36 32 30 29 Feb 2026 100 46 38 34 31 Mar 2026 100 44 37 33 Apr 2026 100 47 39 May 2026 100 48
Illustrative data. M0 is 100% by construction; each row decays then flattens. Reading down a column compares the same age across cohorts. The empty staircase is just time that hasn't passed yet.

How to read the curve

Each row of the matrix is a retention curve for one cohort. Read it three ways:

  1. Across a row (one cohort over time). Does the percentage fall and then flatten, or keep sliding toward zero? A row that levels off at, say, 40% means you found a sticky core. A row that decays every month is a leak.
  2. Down a column (same age, different cohorts). Compare M1 for January, February, and March. If newer cohorts show lower M1 retention than older ones, the product is getting worse at retaining users, even when climbing total signups hide it in the aggregate.
  3. The shape of the floor. Two products can share an identical M1 and end up wildly apart at M12. The slope after the initial drop is what forecasts long-term retention, which is why the curve beats any single number.

A common gotcha: if your M0 is ever below 100%, your active-event definition is narrower than your cohort definition (say, the cohort is signups but the event is “placed an order”). That’s a legitimate activation question rather than a bug. Just know which one you’re measuring. See activation funnel for the distinction.

Weekly cohorts and zero-fill

For faster-moving products, swap 'month' for 'week' in DATE_TRUNC and change the offset math to weeks. If you want a row for every period even when zero users were active (so a charting tool draws an unbroken line), generate a dense spine of offsets with generate_series (Postgres) or GENERATE_ARRAY (BigQuery), CROSS JOIN it to each cohort, then LEFT JOIN the counts and COALESCE(active_users, 0).

How Datost handles this

The query above is the easy part. The hard part is agreeing on what “active” means and holding that definition steady across every cohort table your team builds. Datost grounds each question in your warehouse schema, your metric definitions, and your business context before it writes SQL. So when someone asks “show me retention by signup cohort for 2026,” the active-user definition matches the one already on your dashboard, and the SQL comes attached for review. It also watches those cohorts continuously and flags in Slack when a newer cohort starts decaying faster than the last, before anyone thinks to run the query.