# Safe Column Migrations (NOT NULL & type changes) **RULE — ALWAYS APPLY.** Never run a bare `ALTER COLUMN ... SET NOT NULL` or a bare `ALTER COLUMN ... TYPE ...` (e.g. `int → bigint`) in a single migration. Both take an `ACCESS EXCLUSIVE` lock and either full-scan (`SET NOT NULL`) or fully rewrite (`TYPE`) the table while blocking **all** reads and writes — an outage on any large/hot table. Use the multi-step rollouts below. Always prefix strong-lock statements with `SET lock_timeout = '3s'` so the migration fails fast instead of queueing behind a long transaction and blocking everyone. ## One migration per release — never chain dependent migrations in the same deploy A migration that depends on a previous one (e.g. Playbook A's `SET NOT NULL` depends on the *validated* CHECK created by the prior migration) **must ship in a separate, later release**. Do not put both migrations in the same PR/deploy. The next migration may only go out once the previous one has **actually run in every environment** (dev → stg → prd, plus testslots) and you have **verified it finished OK** — not merely merged. Leave a deploy cycle of separation between them (at least the normal promotion time across environments); don't rush the second migration out minutes after the first. Why: the steps are split precisely so each strong-lock operation is isolated and confirmable. If both ran back-to-back in the same batch, a slow `VALIDATE CONSTRAINT` on a big table would be followed immediately by `SET NOT NULL`, with no chance to confirm the validation actually completed — and if it didn't, `SET NOT NULL` falls back to a blocking full scan, the exact outage we're avoiding. **Verification gate before the dependent migration.** For the `NOT NULL` playbook, confirm in **each** environment that the CHECK exists *and* is validated: ```sql SELECT conname, convalidated FROM pg_constraint WHERE conname = 'Groups_membersCanLeave_not_null'; -- Expect exactly one row with convalidated = t. -- If convalidated = f (or no row), Migration 1 has not finished — do NOT run Migration 2 yet. ``` ## First: figure out which step you're on These rollouts span **multiple PRs** (migrations + app deploys). A prior PR may already have done some steps, so **never assume you start at step 1**. Detect the current state from the artifacts each step leaves behind, then jump to the next step. Inspect these sources for the target `"{Table}"."{column}"` (substitute real names): - **Existing migrations** — what already shipped: ```bash grep -rln '{Table}' humand-packages/migrations-runner/src/main-migrations/ | xargs grep -l '{column}' ``` Look for which statements appear: `ADD CONSTRAINT ... CHECK (... IS NOT NULL) NOT VALID`, `VALIDATE CONSTRAINT`, `SET NOT NULL`, `addColumn('{Table}', '{column}New'`, `DROP NOT NULL`, `renameColumn`. - **ORM model** — what the app currently expects. Find the Sequelize model for the table (e.g. under a module's `ormModels/` or `src/api/.../models`) and read the column's `allowNull` and `type`. - **App code** — whether the guard / dual-write / cutover deploys happened: `grep -rn '{column}' humand-packages/monolith/src` and check whether code writes it, dual-writes `{column}New`, and which one it reads. - **DB state (authoritative, if you have a connection)** — only the DB can confirm NULL counts and the live constraint/column state: ```sql SELECT count(*) FROM "{Table}" WHERE "{column}" IS NULL; -- NULLs remaining SELECT conname, convalidated FROM pg_constraint WHERE conname = '{Table}_{column}_not_null'; -- CHECK present + validated (convalidated = t)? SELECT is_nullable, data_type FROM information_schema.columns WHERE table_name = '{Table}' AND column_name = '{column}'; -- column state ``` ### Decision table — Playbook A (`NOT NULL`) | Observed state | You are at | Do next | |---|---|---| | Column is `NOT NULL` in DB | ✅ done | nothing | | Validated `{Table}_{column}_not_null` CHECK exists, column still nullable | after step 3 | **step 4** (SET NOT NULL + DROP CHECK) | | No CHECK, `SELECT count(*) ... IS NULL` = 0, app no longer writes NULL | after step 2 | **step 3** (add NOT VALID CHECK + VALIDATE) | | App guard shipped (code never writes NULL) but NULL rows still exist | after step 1 | **step 2** (backfill, then verify 0) | | Column nullable and app still writes/accepts NULL | nothing done | **step 1** (app guard) | ### Decision table — Playbook B (type widening) | Observed state | You are at | Do next | |---|---|---| | Old column renamed to `{column}Old` (or dropped) | ✅ done | nothing (optional cleanup migration) | | App reads/writes only `{column}New`, old column still named `{column}` | after step 5 | **step 7** (rename old column) | | Old column is now nullable, app still reads/writes old `{column}` | after step 5 | **step 6** (app cutover deploy) | | `{column}New` is `NOT NULL`, old `{column}` still `NOT NULL` | after step 4 | **step 5** (DROP NOT NULL on old column) | | `{column}New` fully backfilled (0 NULLs) but still nullable | after step 3 | **step 4** (finalize `{column}New` as NOT NULL — run Playbook A on it) | | App dual-writes both columns, `{column}New` still has NULLs | after step 2 | **step 3** (backfill `{column}New`, then verify 0) | | `{column}New` column exists but app does not dual-write it | after step 1 | **step 2** (app dual-write deploy) | | Only the old `{column}` exists | nothing done | **step 1** (add new nullable column) | > If two PRs are in flight (e.g. a migration merged but its paired app deploy hasn't gone out yet), trust the **app code on the current branch** plus the **DB state** over the migration files alone — the deploy ordering is what determines safety. ## Playbook A — make a column `NOT NULL` Ordered rollout (do not skip or reorder): 1. **App guard (deploy first):** change application code so it never writes `NULL` to the column. New rows are already non-null before any migration runs. 2. **Backfill:** update existing `NULL` rows to a valid value (in batches for large tables). Verify: `SELECT count(*) FROM "Table" WHERE "column" IS NULL` returns `0`. 3. **Migration 1 — add a validated CHECK** (cheap lock, no blocking scan): ```javascript async up(queryInterface) { await queryInterface.sequelize.query(`SET lock_timeout = '3s'`); // NOT VALID = instant, only checks new rows, no full scan under strong lock await queryInterface.sequelize.query(` ALTER TABLE "Groups" ADD CONSTRAINT "Groups_membersCanLeave_not_null" CHECK ("membersCanLeave" IS NOT NULL) NOT VALID `); // VALIDATE scans the table but only under SHARE UPDATE EXCLUSIVE (reads/writes keep working) await queryInterface.sequelize.query(` ALTER TABLE "Groups" VALIDATE CONSTRAINT "Groups_membersCanLeave_not_null" `); }, async down(queryInterface) { await queryInterface.sequelize.query(` ALTER TABLE "Groups" DROP CONSTRAINT IF EXISTS "Groups_membersCanLeave_not_null" `); }, ``` > **Retry note:** the `SET lock_timeout = '3s'` set at the top of the migration also applies to `VALIDATE CONSTRAINT` (it runs in the same session). `VALIDATE CONSTRAINT` only needs a weak `SHARE UPDATE EXCLUSIVE` lock, but if a concurrent `ACCESS EXCLUSIVE`/`EXCLUSIVE` lock is held it fails after 3 s with a lock-timeout error instead of waiting. That is the intended fail-fast behaviour — the migration is **safe to re-run** during a low-traffic window. 4. **Migration 2 — promote to `NOT NULL`** (separate, later release — see "One migration per release" above; only after Migration 1 has been deployed and the verification gate shows `convalidated = t` in every environment). Because a validated `CHECK (col IS NOT NULL)` already exists, Postgres ≥12 skips the full scan, so `SET NOT NULL` is near-instant: ```javascript async up(queryInterface) { await queryInterface.sequelize.query(`SET lock_timeout = '3s'`); await queryInterface.sequelize.query(` ALTER TABLE "Groups" ALTER COLUMN "membersCanLeave" SET NOT NULL `); // Drop the now-redundant CHECK (the column-level NOT NULL replaces it) await queryInterface.sequelize.query(` ALTER TABLE "Groups" DROP CONSTRAINT "Groups_membersCanLeave_not_null" `); }, async down(queryInterface) { await queryInterface.sequelize.query(`SET lock_timeout = '3s'`); await queryInterface.sequelize.query(` ALTER TABLE "Groups" ALTER COLUMN "membersCanLeave" DROP NOT NULL `); // Restore the *validated* CHECK that Migration 1 left behind — not just NOT VALID. // If we re-added it as NOT VALID, re-running Migration 2 would no longer skip the // full scan (Postgres only trusts a validated constraint), defeating the playbook. await queryInterface.sequelize.query(` ALTER TABLE "Groups" ADD CONSTRAINT "Groups_membersCanLeave_not_null" CHECK ("membersCanLeave" IS NOT NULL) NOT VALID `); await queryInterface.sequelize.query(` ALTER TABLE "Groups" VALIDATE CONSTRAINT "Groups_membersCanLeave_not_null" `); }, ``` Constraint naming convention: `"{Table}_{column}_not_null"`. ## Playbook B — widen a column type (e.g. `int → bigint`) Never rewrite in place. Add a sibling column and migrate onto it. Ordered rollout: 1. **Migration 1 — add the new nullable column** (cheap; nullable column add is metadata-only): ```javascript async up(queryInterface, Sequelize) { await queryInterface.sequelize.query(`SET lock_timeout = '3s'`); await queryInterface.addColumn('Sessions', 'viewCountNew', { type: Sequelize.BIGINT, allowNull: true, }); }, async down(queryInterface) { await queryInterface.removeColumn('Sessions', 'viewCountNew'); }, ``` 2. **App dual-write (deploy):** application writes both `viewCount` and `viewCountNew` on every insert/update. Keep reading from the old column for now. 3. **Backfill** the new column from the old one, in batches for large tables, then verify: ```javascript async up(queryInterface) { let affected; do { const [, meta] = await queryInterface.sequelize.query(` WITH batch AS ( SELECT "id" FROM "Sessions" WHERE "viewCountNew" IS NULL AND "viewCount" IS NOT NULL LIMIT 10000 ) UPDATE "Sessions" s SET "viewCountNew" = s."viewCount"::bigint FROM batch WHERE s."id" = batch."id" `); affected = meta.rowCount; } while (affected > 0); }, async down() { // no-op: backfilled values are harmless; rolling back Migration 1 will drop the column entirely }, ``` Final verification (run manually or as a guard): `SELECT count(*) FROM "Sessions" WHERE "viewCountNew" IS NULL` must be `0` (assuming `viewCount` itself is `NOT NULL`). 4. **Migration — finalize the new column as `NOT NULL`** using Playbook A's CHECK + VALIDATE + SET NOT NULL + DROP CHECK on `viewCountNew` (same rule: the SET NOT NULL goes in a separate, later release, gated on the `convalidated = t` check from the prior migration). 5. **Migration — drop `NOT NULL` on the old column** (only if `viewCount` was `NOT NULL`). This **must** run **before** the cutover deploy: once the app stops writing `viewCount`, any new `INSERT` that omits it would violate the old `NOT NULL` and fail. `DROP NOT NULL` is metadata-only (instant, no scan): ```javascript async up(queryInterface) { await queryInterface.sequelize.query(`SET lock_timeout = '3s'`); await queryInterface.sequelize.query(` ALTER TABLE "Sessions" ALTER COLUMN "viewCount" DROP NOT NULL `); }, async down() { // No-op on purpose: re-adding NOT NULL would require the full CHECK + VALIDATE // + SET NOT NULL dance (a bare SET NOT NULL here would full-scan under ACCESS // EXCLUSIVE). The old column is being retired anyway, so leave it nullable. }, ``` 6. **App cutover (deploy):** application reads **and** writes only `viewCountNew`. The old `viewCount` is now unused and nullable, so inserts that omit it succeed. 7. **Migration — rename the old column out of the way** (do not drop immediately; keep it for one release as a safety net): ```javascript async up(queryInterface) { await queryInterface.renameColumn('Sessions', 'viewCount', 'viewCountOld'); }, async down(queryInterface) { await queryInterface.renameColumn('Sessions', 'viewCountOld', 'viewCount'); }, ``` A later cleanup migration can drop `viewCountOld` once the rename has shipped safely. ## Quick checklist before writing either migration type - [ ] Strong-lock statements are preceded by `SET lock_timeout = '3s'`. - [ ] `SET NOT NULL` is only ever run **after** a validated `CHECK (col IS NOT NULL)` exists (separate prior migration). - [ ] Type changes use add-column + dual-write + backfill + rename — never `ALTER COLUMN ... TYPE` in place on a populated table. - [ ] On a type change, if the old column was `NOT NULL`, its `NOT NULL` is dropped **before** the app stops writing it (otherwise inserts that omit it fail). - [ ] App-side deploys (guard / dual-write / cutover) are sequenced relative to migrations as described. - [ ] Both `up` and `down` are implemented.