Skip to content

Runbook: Database Restore Drill

Cadence: quarterly (Q1/Q2/Q3/Q4), first week of the quarter. Owner: Platform on-call rotation. SLO target: RTO ≤ 1 hour (see docs/slo.md).

Purpose

"We have backups" is not the same as "we can restore." This drill proves the restore path works end-to-end against a real snapshot, in a scratch environment, measured against the RTO.

Pre-requisites

  • Latest nightly snapshot identifier (from the provider console)
  • A scratch Postgres instance (CI-style, not prod) with enough disk
  • Read access to the logical-dump S3 bucket
  • pg_restore available locally or in CI

Drill procedure

Run through these steps and record timings. The full drill should complete in ≤ 1 hour.

1. Provision scratch DB (5 min)

bash
# Either spin up a local postgres:17 container...
docker run -d --name spade-restore-drill \
  -e POSTGRES_USER=spade -e POSTGRES_PASSWORD=spade \
  -e POSTGRES_DB=spade_restore -p 55432:5432 postgres:17

# ...or `terraform apply` the scratch workspace for a cloud instance.

2. Fetch the latest logical dump (5 min)

bash
aws s3 cp "s3://breezycorp-backups-$REGION/breezycorp_$(date +%Y-%m-%d).dump" /tmp/

3. Restore (15 min)

bash
PGPASSWORD=spade pg_restore \
  --host=localhost --port=55432 --username=spade \
  --dbname=spade_restore --verbose --no-owner --no-privileges \
  /tmp/breezycorp_$(date +%Y-%m-%d).dump

4. Verify schema + migrations state (5 min)

bash
DATABASE_URL="postgresql://spade:spade@localhost:55432/spade_restore" \
  pnpm db:migrate:status

Expected output: all migrations applied, no pending. Any drift is a fail — open an incident.

5. Spot-check row counts (5 min)

sql
-- Total counts should match pre-restore production snapshot (±1% for nightly delta)
SELECT
  (SELECT COUNT(*) FROM clients) AS clients,
  (SELECT COUNT(*) FROM payroll_cycles) AS cycles,
  (SELECT COUNT(*) FROM submissions) AS submissions,
  (SELECT COUNT(*) FROM files) AS files,
  (SELECT COUNT(*) FROM audit_events) AS audits;

6. Spot-check data integrity (10 min)

Pick the most recent ARCHIVED cycle and walk the relations:

sql
-- Pick the newest archived cycle
SELECT id, client_id, cycle_month FROM payroll_cycles
WHERE overall_status = 'ARCHIVED'
ORDER BY closed_at DESC LIMIT 1;

Then verify the submission, approval round, and files all exist and match what the prod system reports for that cycle.

7. Boot the app against the restored DB (10 min)

bash
DATABASE_URL="postgresql://spade:spade@localhost:55432/spade_restore" \
  NODE_ENV=drill pnpm --filter @breezycorp/api dev

Hit /health/ready — expect 200 with db: true. Visit /ops/cycles (with a staff login from the restore) and confirm the list renders.

8. Tear down scratch DB (1 min)

bash
docker stop spade-restore-drill && docker rm spade-restore-drill
# or: terraform destroy -workspace=restore-drill

9. Record results (5 min)

Append a row to docs/restore-drill-log.md:

| Date | Operator | Snapshot date | Total time | Outcome | Notes |

Sign off the drill in the #spade-ops Slack channel so auditors can trace it.

Failure modes to watch for

  • pg_restore errors on extension missing — the restore DB must have the same extensions (pgcrypto, uuid-ossp). Pre-create them before the restore.
  • Row count mismatch > 1% — could indicate a silent backup corruption; escalate to the DB provider.
  • Missing rows in audit_events — audit retention is 7 years; if dumps aren't including them, the dump command is wrong.
  • migrate:status shows drift — the live DB has unmigrated schema. Fix production before the next drill.

When to run out-of-band

  • Before any major Postgres version upgrade
  • After changing the backup tooling (pg_dump flags, new S3 bucket, etc.)
  • After a near-miss recovery event
  • Before a major release that alters the data model significantly

Internal use only — BreezyCorp