Backup Strategy

Primary database (Postgres)

Layer	Cadence	Retention	Storage
Point-in-time recovery	Continuous WAL archiving	7 days	Managed by provider (Neon / RDS / Cloud SQL)
Snapshots	Nightly 03:00 SGT	30 days	Provider-native snapshots
Weekly snapshots	Sunday 03:00 SGT	12 weeks	Provider-native snapshots
Logical dumps	Nightly 03:30 SGT	90 days	Separate S3 bucket, cross-region replication

Logical dump details

Taken with pg_dump --format=custom --compress=9 --no-owner --no-privileges
Encrypted with a customer-managed KMS key (BACKUP_KMS_KEY_ID env)
File name: breezycorp_<YYYY-MM-DD>.dump
S3 bucket: breezycorp-backups-<region>, lifecycle policy auto-expires objects at 90 days
Bucket is write-once for the backup IAM role — principle-of-least-privilege prevents a compromised API key from deleting backups

Why both snapshots and logical dumps?

Snapshots are fastest for operational recovery (restore drill measures ~15 min for our data volume). Logical dumps give us a portable, provider-independent archive we can restore onto a different Postgres engine if the provider fails catastrophically or we migrate.

S3 (file uploads)

Versioning: enabled on the primary bucket
Replication: cross-region to a DR bucket, enabled via S3 Replication Configuration
Lifecycle: uploads tagged retention=5y via the retention policy (see docs/retention-policy.md); audit-linked files retained 7 years

Configuration backups

Not every operational artefact lives in Postgres or S3. These need explicit backup:

Staff user password hashes — included in the logical dump
Client templates (xlsx files) — stored in S3 with versioning
Secrets (Vault / AWS Secrets Manager) — managed by the secrets provider; rotation history kept per runbooks/secret-rotation.md
Infrastructure-as-code — committed in git; backup = git remote mirror

Restore procedure

See runbooks/restore-drill.md for the step-by-step procedure and the quarterly drill checklist.

What's deliberately NOT backed up

pg-boss job tables — jobs are idempotent; lost jobs are retried from the domain state. Restoring jobs mid-flight would produce duplicates.
Session tokens (staff_sessions) — hashed opaque tokens; forcing re-login after a restore is safer than preserving them
Audit debounce cache (in-memory) — rebuilds on the fly
Prometheus metrics history — stored separately by the metrics backend

Monitoring

Backup success/failure alerts go to #spade-ops (Slack) and PagerDuty
Weekly health check: docs/runbooks/restore-drill.md verifies restorability
Missed backup alert threshold: 1 missed run pages on-call

Backup Strategy ​

Primary database (Postgres) ​

Logical dump details ​

Why both snapshots and logical dumps? ​

S3 (file uploads) ​

Configuration backups ​

Restore procedure ​

What's deliberately NOT backed up ​

Monitoring ​