Runbook: Database Restore Drill

Cadence: quarterly (Q1/Q2/Q3/Q4), first week of the quarter. Owner: Platform on-call rotation. SLO target: RTO ≤ 1 hour (see docs/slo.md).

Purpose

"We have backups" is not the same as "we can restore." This drill proves the restore path works end-to-end against a real snapshot, in a scratch environment, measured against the RTO.

Pre-requisites

Latest nightly snapshot identifier (from the provider console)
A scratch Postgres instance (CI-style, not prod) with enough disk
Read access to the logical-dump S3 bucket
pg_restore available locally or in CI

Drill procedure

Run through these steps and record timings. The full drill should complete in ≤ 1 hour.

1. Provision scratch DB (5 min)

bash

# Either spin up a local postgres:17 container...
docker run -d --name spade-restore-drill \
  -e POSTGRES_USER=spade -e POSTGRES_PASSWORD=spade \
  -e POSTGRES_DB=spade_restore -p 55432:5432 postgres:17

# ...or `terraform apply` the scratch workspace for a cloud instance.

2. Fetch the latest logical dump (5 min)

bash

aws s3 cp "s3://breezycorp-backups-$REGION/breezycorp_$(date +%Y-%m-%d).dump" /tmp/

3. Restore (15 min)

bash

PGPASSWORD=spade pg_restore \
  --host=localhost --port=55432 --username=spade \
  --dbname=spade_restore --verbose --no-owner --no-privileges \
  /tmp/breezycorp_$(date +%Y-%m-%d).dump

4. Verify schema + migrations state (5 min)

bash

DATABASE_URL="postgresql://spade:spade@localhost:55432/spade_restore" \
  pnpm db:migrate:status

Expected output: all migrations applied, no pending. Any drift is a fail — open an incident.

5. Spot-check row counts (5 min)

sql

-- Total counts should match pre-restore production snapshot (±1% for nightly delta)
SELECT
  (SELECT COUNT(*) FROM clients) AS clients,
  (SELECT COUNT(*) FROM payroll_cycles) AS cycles,
  (SELECT COUNT(*) FROM submissions) AS submissions,
  (SELECT COUNT(*) FROM files) AS files,
  (SELECT COUNT(*) FROM audit_events) AS audits;

6. Spot-check data integrity (10 min)

Pick the most recent ARCHIVED cycle and walk the relations:

sql

-- Pick the newest archived cycle
SELECT id, client_id, cycle_month FROM payroll_cycles
WHERE overall_status = 'ARCHIVED'
ORDER BY closed_at DESC LIMIT 1;

Then verify the submission, approval round, and files all exist and match what the prod system reports for that cycle.

7. Boot the app against the restored DB (10 min)

bash

DATABASE_URL="postgresql://spade:spade@localhost:55432/spade_restore" \
  NODE_ENV=drill pnpm --filter @breezycorp/api dev

Hit /health/ready — expect 200 with db: true. Visit /ops/cycles (with a staff login from the restore) and confirm the list renders.

8. Tear down scratch DB (1 min)

bash

docker stop spade-restore-drill && docker rm spade-restore-drill
# or: terraform destroy -workspace=restore-drill

9. Record results (5 min)

Append a row to docs/restore-drill-log.md:

| Date | Operator | Snapshot date | Total time | Outcome | Notes |

Sign off the drill in the #spade-ops Slack channel so auditors can trace it.

Failure modes to watch for

pg_restore errors on extension missing — the restore DB must have the same extensions (pgcrypto, uuid-ossp). Pre-create them before the restore.
Row count mismatch > 1% — could indicate a silent backup corruption; escalate to the DB provider.
Missing rows in audit_events — audit retention is 7 years; if dumps aren't including them, the dump command is wrong.
migrate:status shows drift — the live DB has unmigrated schema. Fix production before the next drill.

When to run out-of-band

Before any major Postgres version upgrade
After changing the backup tooling (pg_dump flags, new S3 bucket, etc.)
After a near-miss recovery event
Before a major release that alters the data model significantly

Runbook: Database Restore Drill ​

Purpose ​

Pre-requisites ​

Drill procedure ​

1. Provision scratch DB (5 min) ​

2. Fetch the latest logical dump (5 min) ​

3. Restore (15 min) ​

4. Verify schema + migrations state (5 min) ​

5. Spot-check row counts (5 min) ​

6. Spot-check data integrity (10 min) ​

7. Boot the app against the restored DB (10 min) ​

8. Tear down scratch DB (1 min) ​

9. Record results (5 min) ​

Failure modes to watch for ​

When to run out-of-band ​

Runbook: Database Restore Drill

Purpose

Pre-requisites

Drill procedure

1. Provision scratch DB (5 min)

2. Fetch the latest logical dump (5 min)

3. Restore (15 min)

4. Verify schema + migrations state (5 min)

5. Spot-check row counts (5 min)

6. Spot-check data integrity (10 min)

7. Boot the app against the restored DB (10 min)

8. Tear down scratch DB (1 min)

9. Record results (5 min)

Failure modes to watch for

When to run out-of-band