On-Call Rotation & Alerting
Rotation
- Primary on-call: 1 engineer, 1-week rotation, handover Mondays 09:00 SGT
- Secondary on-call: 1 engineer, same cadence, covers primary unavailability
- Escalation: Engineering lead → CTO for P0 incidents
- Hours: 24/7 for P0/P1, business hours only for P2/P3
Severity levels
| Level | Definition | Response time |
|---|---|---|
| P0 | Data loss, security breach, full outage | 5 min acknowledge · page immediately |
| P1 | Partial outage, SLO burn rate > 10×, blocked client workflow | 15 min · page during business hours, Slack after-hours |
| P2 | Degraded performance, SLO burn rate > 2×, non-blocking bugs | 1 hour · Slack only |
| P3 | Minor bugs, cosmetic issues | Next business day |
Alert routing
- #spade-ops (Slack) — all warnings, all P2/P3, passive observation
- PagerDuty — P0/P1 only; rotates to primary → secondary → Engineering lead
- Email to platform@spade — daily digest of warnings for trend review
Alerting rules
These are the Prometheus / Grafana alert definitions. The thresholds derive from docs/slo.md.
P0 alerts (page immediately)
yaml
- alert: API_Down
expr: up{job="spade-api"} == 0
for: 2m
labels: { severity: critical }
annotations:
summary: "Spade API is down"
runbook: "docs/runbooks/graceful-restart.md"
- alert: Database_Unreachable
expr: spade_db_ready == 0
for: 1m
labels: { severity: critical }
annotations:
summary: "Postgres unreachable from API"
runbook: "docs/runbooks/restore-drill.md"
- alert: Tenant_Leak_Suspected
expr: increase(spade_cross_tenant_read_unexpected_total[5m]) > 0
labels: { severity: critical }
annotations:
summary: "Unexpected cross-tenant read detected"
runbook: "docs/runbooks/tenant-leak-suspected.md"
- alert: Backup_Missed
expr: time() - spade_last_backup_timestamp > 28 * 3600
labels: { severity: critical }
annotations:
summary: "Nightly backup missed"
runbook: "docs/backup-strategy.md"P1 alerts (page during business hours)
yaml
- alert: High_Error_Rate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 10m
labels: { severity: high }
annotations:
summary: "API 5xx rate > 5% for 10 min"
- alert: SLO_Burn_Rate_High
expr: slo_burn_rate{window="1h"} > 10
for: 15m
labels: { severity: high }
annotations:
summary: "Error budget burning 10x faster than target"
runbook: "docs/slo.md#error-budget-policy"
- alert: Outbox_Dead_Lettered
expr: spade_outbox_dead_letter_count > 0
labels: { severity: high }
annotations:
summary: "Outbox events have been dead-lettered"
runbook: "docs/runbooks/outbox-stuck.md"
- alert: Readiness_Probe_Failing
expr: probe_success{job="readiness"} == 0
for: 3m
labels: { severity: high }
annotations:
summary: "API readiness probe failing"
runbook: "docs/runbooks/graceful-restart.md"P2 alerts (Slack only)
yaml
- alert: OCR_Failure_Rate
expr: rate(spade_ocr_failures[15m]) / rate(spade_ocr_total[15m]) > 0.10
for: 15m
labels: { severity: warning }
annotations:
summary: "OCR failure rate > 10% over 15 min"
runbook: "docs/runbooks/ocr-failing.md"
- alert: DB_Connection_Pool_Saturation
expr: pg_stat_activity_count / pg_settings_max_connections > 0.8
for: 5m
labels: { severity: warning }
annotations:
summary: "Postgres connection pool > 80% saturated"
runbook: "docs/capacity.md"
- alert: Outbox_Backlog
expr: spade_outbox_pending > 100
for: 10m
labels: { severity: warning }
annotations:
summary: "Outbox backlog > 100 events"
runbook: "docs/runbooks/outbox-stuck.md"
- alert: Validation_Latency_High
expr: histogram_quantile(0.95, rate(spade_validation_duration_seconds_bucket[5m])) > 30
for: 10m
labels: { severity: warning }
annotations:
summary: "Validation run P95 > 30s"
runbook: "docs/slo.md"Incident response process
First 5 minutes (P0/P1)
- Acknowledge the page on PagerDuty to stop the re-page loop
- Open
#incident-<date>-<short>Slack channel — invite primary + secondary + engineering lead - Pin the incident commander (usually the on-call)
- State the observation in the channel (no diagnosis yet)
- Open the runbook linked in the alert
Next 15 minutes
- Declare severity and notify stakeholders via the severity's notification lane
- Start the timeline — a pinned message in the incident channel, updated as events unfold
- Follow the runbook — if the runbook doesn't match reality, note why in the timeline
- Mitigate first, fix second — rollback, scale, toggle a feature flag; the clean fix can wait
Mitigation → resolution
- Confirm recovery via the alert clearing AND a manual smoke test
- All-clear in #spade-ops with the recovery time
- Schedule the post-mortem — within 72 hours, blameless, published in
docs/post-mortems/
Post-mortem template
TODO: create docs/templates/postmortem.md. Structure:
- Summary (1 paragraph)
- Impact (clients affected, duration, SLO burn)
- Timeline (from first detection to all-clear)
- Root cause (technical + process)
- What went well
- What didn't
- Action items (with owners + due dates)
On-call handover checklist
Every Monday at 09:00 SGT, primary-outgoing → primary-incoming runs through:
- [ ] Any open incidents / ongoing investigations
- [ ] Any alerts fired in the past week (resolved or not)
- [ ] Any runbook updates the incoming on-call should know about
- [ ] Any known maintenance windows coming up
- [ ] Any dependency announcements (provider status, upcoming rotations)
Contact escalation
Primary on-call → (5 min no-ack) → Secondary on-call
Secondary → (5 min no-ack) → Engineering lead
Engineering lead → CTO (P0 only)Numbers and Slack handles live in the internal wiki (not committed here).
Related
docs/slo.md— the SLOs driving these alertsdocs/runbooks/— per-incident proceduresdocs/capacity.md— sizing that informs the thresholdsdocs/pii-inventory.md— what to protect during an incident