# Safe Column Migrations (NOT NULL & type changes)

**RULE — ALWAYS APPLY.** Never run a bare `ALTER COLUMN ... SET NOT NULL` or a bare `ALTER COLUMN ... TYPE ...` (e.g. `int → bigint`) in a single migration. Both take an `ACCESS EXCLUSIVE` lock and either full-scan (`SET NOT NULL`) or fully rewrite (`TYPE`) the table while blocking **all** reads and writes — an outage on any large/hot table. Use the multi-step rollouts below. Always prefix strong-lock statements with `SET lock_timeout = '3s'` so the migration fails fast instead of queueing behind a long transaction and blocking everyone.

## One migration per release — never chain dependent migrations in the same deploy

A migration that depends on a previous one (e.g. Playbook A's `SET NOT NULL` depends on the *validated* CHECK created by the prior migration) **must ship in a separate, later release**. Do not put both migrations in the same PR/deploy.

The next migration may only go out once the previous one has **actually run in every environment** (dev → stg → prd, plus testslots) and you have **verified it finished OK** — not merely merged. Leave a deploy cycle of separation between them (at least the normal promotion time across environments); don't rush the second migration out minutes after the first.

Why: the steps are split precisely so each strong-lock operation is isolated and confirmable. If both ran back-to-back in the same batch, a slow `VALIDATE CONSTRAINT` on a big table would be followed immediately by `SET NOT NULL`, with no chance to confirm the validation actually completed — and if it didn't, `SET NOT NULL` falls back to a blocking full scan, the exact outage we're avoiding.

**Verification gate before the dependent migration.** For the `NOT NULL` playbook, confirm in **each** environment that the CHECK exists *and* is validated:

```sql
SELECT conname, convalidated
FROM pg_constraint
WHERE conname = 'Groups_membersCanLeave_not_null';
-- Expect exactly one row with convalidated = t.
-- If convalidated = f (or no row), Migration 1 has not finished — do NOT run Migration 2 yet.
```

## First: figure out which step you're on

These rollouts span **multiple PRs** (migrations + app deploys). A prior PR may already have done some steps, so **never assume you start at step 1**. Detect the current state from the artifacts each step leaves behind, then jump to the next step.

Inspect these sources for the target `"{Table}"."{column}"` (substitute real names):

- **Existing migrations** — what already shipped:
  ```bash
  grep -rln '{Table}' humand-packages/migrations-runner/src/main-migrations/ | xargs grep -l '{column}'
  ```
  Look for which statements appear: `ADD CONSTRAINT ... CHECK (... IS NOT NULL) NOT VALID`, `VALIDATE CONSTRAINT`, `SET NOT NULL`, `addColumn('{Table}', '{column}New'`, `DROP NOT NULL`, `renameColumn`.
- **ORM model** — what the app currently expects. Find the Sequelize model for the table (e.g. under a module's `ormModels/` or `src/api/.../models`) and read the column's `allowNull` and `type`.
- **App code** — whether the guard / dual-write / cutover deploys happened: `grep -rn '{column}' humand-packages/monolith/src` and check whether code writes it, dual-writes `{column}New`, and which one it reads.
- **DB state (authoritative, if you have a connection)** — only the DB can confirm NULL counts and the live constraint/column state:
  ```sql
  SELECT count(*) FROM "{Table}" WHERE "{column}" IS NULL;                 -- NULLs remaining
  SELECT conname, convalidated FROM pg_constraint WHERE conname = '{Table}_{column}_not_null';  -- CHECK present + validated (convalidated = t)?
  SELECT is_nullable, data_type FROM information_schema.columns
   WHERE table_name = '{Table}' AND column_name = '{column}';             -- column state
  ```

### Decision table — Playbook A (`NOT NULL`)

| Observed state | You are at | Do next |
|---|---|---|
| Column is `NOT NULL` in DB | ✅ done | nothing |
| Validated `{Table}_{column}_not_null` CHECK exists, column still nullable | after step 3 | **step 4** (SET NOT NULL + DROP CHECK) |
| No CHECK, `SELECT count(*) ... IS NULL` = 0, app no longer writes NULL | after step 2 | **step 3** (add NOT VALID CHECK + VALIDATE) |
| App guard shipped (code never writes NULL) but NULL rows still exist | after step 1 | **step 2** (backfill, then verify 0) |
| Column nullable and app still writes/accepts NULL | nothing done | **step 1** (app guard) |

### Decision table — Playbook B (type widening)

| Observed state | You are at | Do next |
|---|---|---|
| Old column renamed to `{column}Old` (or dropped) | ✅ done | nothing (optional cleanup migration) |
| App reads/writes only `{column}New`, old column still named `{column}` | after step 5 | **step 7** (rename old column) |
| Old column is now nullable, app still reads/writes old `{column}` | after step 5 | **step 6** (app cutover deploy) |
| `{column}New` is `NOT NULL`, old `{column}` still `NOT NULL` | after step 4 | **step 5** (DROP NOT NULL on old column) |
| `{column}New` fully backfilled (0 NULLs) but still nullable | after step 3 | **step 4** (finalize `{column}New` as NOT NULL — run Playbook A on it) |
| App dual-writes both columns, `{column}New` still has NULLs | after step 2 | **step 3** (backfill `{column}New`, then verify 0) |
| `{column}New` column exists but app does not dual-write it | after step 1 | **step 2** (app dual-write deploy) |
| Only the old `{column}` exists | nothing done | **step 1** (add new nullable column) |

> If two PRs are in flight (e.g. a migration merged but its paired app deploy hasn't gone out yet), trust the **app code on the current branch** plus the **DB state** over the migration files alone — the deploy ordering is what determines safety.

## Playbook A — make a column `NOT NULL`

Ordered rollout (do not skip or reorder):

1. **App guard (deploy first):** change application code so it never writes `NULL` to the column. New rows are already non-null before any migration runs.
2. **Backfill:** update existing `NULL` rows to a valid value (in batches for large tables). Verify: `SELECT count(*) FROM "Table" WHERE "column" IS NULL` returns `0`.
3. **Migration 1 — add a validated CHECK** (cheap lock, no blocking scan):

```javascript
async up(queryInterface) {
    await queryInterface.sequelize.query(`SET lock_timeout = '3s'`);
    // NOT VALID = instant, only checks new rows, no full scan under strong lock
    await queryInterface.sequelize.query(`
        ALTER TABLE "Groups"
        ADD CONSTRAINT "Groups_membersCanLeave_not_null"
        CHECK ("membersCanLeave" IS NOT NULL) NOT VALID
    `);
    // VALIDATE scans the table but only under SHARE UPDATE EXCLUSIVE (reads/writes keep working)
    await queryInterface.sequelize.query(`
        ALTER TABLE "Groups" VALIDATE CONSTRAINT "Groups_membersCanLeave_not_null"
    `);
},
async down(queryInterface) {
    await queryInterface.sequelize.query(`
        ALTER TABLE "Groups" DROP CONSTRAINT IF EXISTS "Groups_membersCanLeave_not_null"
    `);
},
```

> **Retry note:** the `SET lock_timeout = '3s'` set at the top of the migration also applies to `VALIDATE CONSTRAINT` (it runs in the same session). `VALIDATE CONSTRAINT` only needs a weak `SHARE UPDATE EXCLUSIVE` lock, but if a concurrent `ACCESS EXCLUSIVE`/`EXCLUSIVE` lock is held it fails after 3 s with a lock-timeout error instead of waiting. That is the intended fail-fast behaviour — the migration is **safe to re-run** during a low-traffic window.

4. **Migration 2 — promote to `NOT NULL`** (separate, later release — see "One migration per release" above; only after Migration 1 has been deployed and the verification gate shows `convalidated = t` in every environment). Because a validated `CHECK (col IS NOT NULL)` already exists, Postgres ≥12 skips the full scan, so `SET NOT NULL` is near-instant:

```javascript
async up(queryInterface) {
    await queryInterface.sequelize.query(`SET lock_timeout = '3s'`);
    await queryInterface.sequelize.query(`
        ALTER TABLE "Groups" ALTER COLUMN "membersCanLeave" SET NOT NULL
    `);
    // Drop the now-redundant CHECK (the column-level NOT NULL replaces it)
    await queryInterface.sequelize.query(`
        ALTER TABLE "Groups" DROP CONSTRAINT "Groups_membersCanLeave_not_null"
    `);
},
async down(queryInterface) {
    await queryInterface.sequelize.query(`SET lock_timeout = '3s'`);
    await queryInterface.sequelize.query(`
        ALTER TABLE "Groups" ALTER COLUMN "membersCanLeave" DROP NOT NULL
    `);
    // Restore the *validated* CHECK that Migration 1 left behind — not just NOT VALID.
    // If we re-added it as NOT VALID, re-running Migration 2 would no longer skip the
    // full scan (Postgres only trusts a validated constraint), defeating the playbook.
    await queryInterface.sequelize.query(`
        ALTER TABLE "Groups"
        ADD CONSTRAINT "Groups_membersCanLeave_not_null"
        CHECK ("membersCanLeave" IS NOT NULL) NOT VALID
    `);
    await queryInterface.sequelize.query(`
        ALTER TABLE "Groups" VALIDATE CONSTRAINT "Groups_membersCanLeave_not_null"
    `);
},
```

Constraint naming convention: `"{Table}_{column}_not_null"`.

## Playbook B — widen a column type (e.g. `int → bigint`)

Never rewrite in place. Add a sibling column and migrate onto it. Ordered rollout:

1. **Migration 1 — add the new nullable column** (cheap; nullable column add is metadata-only):

```javascript
async up(queryInterface, Sequelize) {
    await queryInterface.sequelize.query(`SET lock_timeout = '3s'`);
    await queryInterface.addColumn('Sessions', 'viewCountNew', {
        type: Sequelize.BIGINT,
        allowNull: true,
    });
},
async down(queryInterface) {
    await queryInterface.removeColumn('Sessions', 'viewCountNew');
},
```

2. **App dual-write (deploy):** application writes both `viewCount` and `viewCountNew` on every insert/update. Keep reading from the old column for now.
3. **Backfill** the new column from the old one, in batches for large tables, then verify:

```javascript
async up(queryInterface) {
    let affected;
    do {
        const [, meta] = await queryInterface.sequelize.query(`
            WITH batch AS (
                SELECT "id" FROM "Sessions"
                WHERE "viewCountNew" IS NULL AND "viewCount" IS NOT NULL
                LIMIT 10000
            )
            UPDATE "Sessions" s
            SET "viewCountNew" = s."viewCount"::bigint
            FROM batch
            WHERE s."id" = batch."id"
        `);
        affected = meta.rowCount;
    } while (affected > 0);
},
async down() {
    // no-op: backfilled values are harmless; rolling back Migration 1 will drop the column entirely
},
```

Final verification (run manually or as a guard): `SELECT count(*) FROM "Sessions" WHERE "viewCountNew" IS NULL` must be `0` (assuming `viewCount` itself is `NOT NULL`).

4. **Migration — finalize the new column as `NOT NULL`** using Playbook A's CHECK + VALIDATE + SET NOT NULL + DROP CHECK on `viewCountNew` (same rule: the SET NOT NULL goes in a separate, later release, gated on the `convalidated = t` check from the prior migration).
5. **Migration — drop `NOT NULL` on the old column** (only if `viewCount` was `NOT NULL`). This **must** run **before** the cutover deploy: once the app stops writing `viewCount`, any new `INSERT` that omits it would violate the old `NOT NULL` and fail. `DROP NOT NULL` is metadata-only (instant, no scan):

```javascript
async up(queryInterface) {
    await queryInterface.sequelize.query(`SET lock_timeout = '3s'`);
    await queryInterface.sequelize.query(`
        ALTER TABLE "Sessions" ALTER COLUMN "viewCount" DROP NOT NULL
    `);
},
async down() {
    // No-op on purpose: re-adding NOT NULL would require the full CHECK + VALIDATE
    // + SET NOT NULL dance (a bare SET NOT NULL here would full-scan under ACCESS
    // EXCLUSIVE). The old column is being retired anyway, so leave it nullable.
},
```

6. **App cutover (deploy):** application reads **and** writes only `viewCountNew`. The old `viewCount` is now unused and nullable, so inserts that omit it succeed.
7. **Migration — rename the old column out of the way** (do not drop immediately; keep it for one release as a safety net):

```javascript
async up(queryInterface) {
    await queryInterface.renameColumn('Sessions', 'viewCount', 'viewCountOld');
},
async down(queryInterface) {
    await queryInterface.renameColumn('Sessions', 'viewCountOld', 'viewCount');
},
```

A later cleanup migration can drop `viewCountOld` once the rename has shipped safely.

## Quick checklist before writing either migration type

- [ ] Strong-lock statements are preceded by `SET lock_timeout = '3s'`.
- [ ] `SET NOT NULL` is only ever run **after** a validated `CHECK (col IS NOT NULL)` exists (separate prior migration).
- [ ] Type changes use add-column + dual-write + backfill + rename — never `ALTER COLUMN ... TYPE` in place on a populated table.
- [ ] On a type change, if the old column was `NOT NULL`, its `NOT NULL` is dropped **before** the app stops writing it (otherwise inserts that omit it fail).
- [ ] App-side deploys (guard / dual-write / cutover) are sequenced relative to migrations as described.
- [ ] Both `up` and `down` are implemented.