Skip to content

Backup Strategy

Primary database (Postgres)

LayerCadenceRetentionStorage
Point-in-time recoveryContinuous WAL archiving7 daysManaged by provider (Neon / RDS / Cloud SQL)
SnapshotsNightly 03:00 SGT30 daysProvider-native snapshots
Weekly snapshotsSunday 03:00 SGT12 weeksProvider-native snapshots
Logical dumpsNightly 03:30 SGT90 daysSeparate S3 bucket, cross-region replication

Logical dump details

  • Taken with pg_dump --format=custom --compress=9 --no-owner --no-privileges
  • Encrypted with a customer-managed KMS key (BACKUP_KMS_KEY_ID env)
  • File name: breezycorp_<YYYY-MM-DD>.dump
  • S3 bucket: breezycorp-backups-<region>, lifecycle policy auto-expires objects at 90 days
  • Bucket is write-once for the backup IAM role — principle-of-least-privilege prevents a compromised API key from deleting backups

Why both snapshots and logical dumps?

Snapshots are fastest for operational recovery (restore drill measures ~15 min for our data volume). Logical dumps give us a portable, provider-independent archive we can restore onto a different Postgres engine if the provider fails catastrophically or we migrate.

S3 (file uploads)

  • Versioning: enabled on the primary bucket
  • Replication: cross-region to a DR bucket, enabled via S3 Replication Configuration
  • Lifecycle: uploads tagged retention=5y via the retention policy (see docs/retention-policy.md); audit-linked files retained 7 years

Configuration backups

Not every operational artefact lives in Postgres or S3. These need explicit backup:

  • Staff user password hashes — included in the logical dump
  • Client templates (xlsx files) — stored in S3 with versioning
  • Secrets (Vault / AWS Secrets Manager) — managed by the secrets provider; rotation history kept per runbooks/secret-rotation.md
  • Infrastructure-as-code — committed in git; backup = git remote mirror

Restore procedure

See runbooks/restore-drill.md for the step-by-step procedure and the quarterly drill checklist.

What's deliberately NOT backed up

  • pg-boss job tables — jobs are idempotent; lost jobs are retried from the domain state. Restoring jobs mid-flight would produce duplicates.
  • Session tokens (staff_sessions) — hashed opaque tokens; forcing re-login after a restore is safer than preserving them
  • Audit debounce cache (in-memory) — rebuilds on the fly
  • Prometheus metrics history — stored separately by the metrics backend

Monitoring

  • Backup success/failure alerts go to #spade-ops (Slack) and PagerDuty
  • Weekly health check: docs/runbooks/restore-drill.md verifies restorability
  • Missed backup alert threshold: 1 missed run pages on-call

Internal use only — BreezyCorp