Backup Strategy
Primary database (Postgres)
| Layer | Cadence | Retention | Storage |
|---|---|---|---|
| Point-in-time recovery | Continuous WAL archiving | 7 days | Managed by provider (Neon / RDS / Cloud SQL) |
| Snapshots | Nightly 03:00 SGT | 30 days | Provider-native snapshots |
| Weekly snapshots | Sunday 03:00 SGT | 12 weeks | Provider-native snapshots |
| Logical dumps | Nightly 03:30 SGT | 90 days | Separate S3 bucket, cross-region replication |
Logical dump details
- Taken with
pg_dump --format=custom --compress=9 --no-owner --no-privileges - Encrypted with a customer-managed KMS key (
BACKUP_KMS_KEY_IDenv) - File name:
breezycorp_<YYYY-MM-DD>.dump - S3 bucket:
breezycorp-backups-<region>, lifecycle policy auto-expires objects at 90 days - Bucket is write-once for the backup IAM role — principle-of-least-privilege prevents a compromised API key from deleting backups
Why both snapshots and logical dumps?
Snapshots are fastest for operational recovery (restore drill measures ~15 min for our data volume). Logical dumps give us a portable, provider-independent archive we can restore onto a different Postgres engine if the provider fails catastrophically or we migrate.
S3 (file uploads)
- Versioning: enabled on the primary bucket
- Replication: cross-region to a DR bucket, enabled via S3 Replication Configuration
- Lifecycle: uploads tagged
retention=5yvia the retention policy (seedocs/retention-policy.md); audit-linked files retained 7 years
Configuration backups
Not every operational artefact lives in Postgres or S3. These need explicit backup:
- Staff user password hashes — included in the logical dump
- Client templates (xlsx files) — stored in S3 with versioning
- Secrets (Vault / AWS Secrets Manager) — managed by the secrets provider; rotation history kept per
runbooks/secret-rotation.md - Infrastructure-as-code — committed in git; backup = git remote mirror
Restore procedure
See runbooks/restore-drill.md for the step-by-step procedure and the quarterly drill checklist.
What's deliberately NOT backed up
- pg-boss job tables — jobs are idempotent; lost jobs are retried from the domain state. Restoring jobs mid-flight would produce duplicates.
- Session tokens (
staff_sessions) — hashed opaque tokens; forcing re-login after a restore is safer than preserving them - Audit debounce cache (in-memory) — rebuilds on the fly
- Prometheus metrics history — stored separately by the metrics backend
Monitoring
- Backup success/failure alerts go to #spade-ops (Slack) and PagerDuty
- Weekly health check:
docs/runbooks/restore-drill.mdverifies restorability - Missed backup alert threshold: 1 missed run pages on-call