On-Call Rotation & Alerting

Rotation

Primary on-call: 1 engineer, 1-week rotation, handover Mondays 09:00 SGT
Secondary on-call: 1 engineer, same cadence, covers primary unavailability
Escalation: Engineering lead → CTO for P0 incidents
Hours: 24/7 for P0/P1, business hours only for P2/P3

Severity levels

Level	Definition	Response time
P0	Data loss, security breach, full outage	5 min acknowledge · page immediately
P1	Partial outage, SLO burn rate > 10×, blocked client workflow	15 min · page during business hours, Slack after-hours
P2	Degraded performance, SLO burn rate > 2×, non-blocking bugs	1 hour · Slack only
P3	Minor bugs, cosmetic issues	Next business day

Alert routing

#spade-ops (Slack) — all warnings, all P2/P3, passive observation
PagerDuty — P0/P1 only; rotates to primary → secondary → Engineering lead
Email to platform@spade — daily digest of warnings for trend review

Alerting rules

These are the Prometheus / Grafana alert definitions. The thresholds derive from docs/slo.md.

P0 alerts (page immediately)

yaml

- alert: API_Down
  expr: up{job="spade-api"} == 0
  for: 2m
  labels: { severity: critical }
  annotations:
    summary: "Spade API is down"
    runbook: "docs/runbooks/graceful-restart.md"

- alert: Database_Unreachable
  expr: spade_db_ready == 0
  for: 1m
  labels: { severity: critical }
  annotations:
    summary: "Postgres unreachable from API"
    runbook: "docs/runbooks/restore-drill.md"

- alert: Tenant_Leak_Suspected
  expr: increase(spade_cross_tenant_read_unexpected_total[5m]) > 0
  labels: { severity: critical }
  annotations:
    summary: "Unexpected cross-tenant read detected"
    runbook: "docs/runbooks/tenant-leak-suspected.md"

- alert: Backup_Missed
  expr: time() - spade_last_backup_timestamp > 28 * 3600
  labels: { severity: critical }
  annotations:
    summary: "Nightly backup missed"
    runbook: "docs/backup-strategy.md"

P1 alerts (page during business hours)

yaml

- alert: High_Error_Rate
  expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
  for: 10m
  labels: { severity: high }
  annotations:
    summary: "API 5xx rate > 5% for 10 min"

- alert: SLO_Burn_Rate_High
  expr: slo_burn_rate{window="1h"} > 10
  for: 15m
  labels: { severity: high }
  annotations:
    summary: "Error budget burning 10x faster than target"
    runbook: "docs/slo.md#error-budget-policy"

- alert: Outbox_Dead_Lettered
  expr: spade_outbox_dead_letter_count > 0
  labels: { severity: high }
  annotations:
    summary: "Outbox events have been dead-lettered"
    runbook: "docs/runbooks/outbox-stuck.md"

- alert: Readiness_Probe_Failing
  expr: probe_success{job="readiness"} == 0
  for: 3m
  labels: { severity: high }
  annotations:
    summary: "API readiness probe failing"
    runbook: "docs/runbooks/graceful-restart.md"

P2 alerts (Slack only)

yaml

- alert: OCR_Failure_Rate
  expr: rate(spade_ocr_failures[15m]) / rate(spade_ocr_total[15m]) > 0.10
  for: 15m
  labels: { severity: warning }
  annotations:
    summary: "OCR failure rate > 10% over 15 min"
    runbook: "docs/runbooks/ocr-failing.md"

- alert: DB_Connection_Pool_Saturation
  expr: pg_stat_activity_count / pg_settings_max_connections > 0.8
  for: 5m
  labels: { severity: warning }
  annotations:
    summary: "Postgres connection pool > 80% saturated"
    runbook: "docs/capacity.md"

- alert: Outbox_Backlog
  expr: spade_outbox_pending > 100
  for: 10m
  labels: { severity: warning }
  annotations:
    summary: "Outbox backlog > 100 events"
    runbook: "docs/runbooks/outbox-stuck.md"

- alert: Validation_Latency_High
  expr: histogram_quantile(0.95, rate(spade_validation_duration_seconds_bucket[5m])) > 30
  for: 10m
  labels: { severity: warning }
  annotations:
    summary: "Validation run P95 > 30s"
    runbook: "docs/slo.md"

Incident response process

First 5 minutes (P0/P1)

Acknowledge the page on PagerDuty to stop the re-page loop
Open #incident-<date>-<short> Slack channel — invite primary + secondary + engineering lead
Pin the incident commander (usually the on-call)
State the observation in the channel (no diagnosis yet)
Open the runbook linked in the alert

Next 15 minutes

Declare severity and notify stakeholders via the severity's notification lane
Start the timeline — a pinned message in the incident channel, updated as events unfold
Follow the runbook — if the runbook doesn't match reality, note why in the timeline
Mitigate first, fix second — rollback, scale, toggle a feature flag; the clean fix can wait

Mitigation → resolution

Confirm recovery via the alert clearing AND a manual smoke test
All-clear in #spade-ops with the recovery time
Schedule the post-mortem — within 72 hours, blameless, published in docs/post-mortems/

Post-mortem template

TODO: create docs/templates/postmortem.md. Structure:

Summary (1 paragraph)
Impact (clients affected, duration, SLO burn)
Timeline (from first detection to all-clear)
Root cause (technical + process)
What went well
What didn't
Action items (with owners + due dates)

On-call handover checklist

Every Monday at 09:00 SGT, primary-outgoing → primary-incoming runs through:

[ ] Any open incidents / ongoing investigations
[ ] Any alerts fired in the past week (resolved or not)
[ ] Any runbook updates the incoming on-call should know about
[ ] Any known maintenance windows coming up
[ ] Any dependency announcements (provider status, upcoming rotations)

Contact escalation

Primary on-call → (5 min no-ack) → Secondary on-call
Secondary → (5 min no-ack) → Engineering lead
Engineering lead → CTO (P0 only)

Numbers and Slack handles live in the internal wiki (not committed here).

docs/slo.md — the SLOs driving these alerts
docs/runbooks/ — per-incident procedures
docs/capacity.md — sizing that informs the thresholds
docs/pii-inventory.md — what to protect during an incident

On-Call Rotation & Alerting ​

Rotation ​

Severity levels ​

Alert routing ​

Alerting rules ​

P0 alerts (page immediately) ​

P1 alerts (page during business hours) ​

P2 alerts (Slack only) ​

Incident response process ​

First 5 minutes (P0/P1) ​

Next 15 minutes ​

Mitigation → resolution ​

Post-mortem template ​

On-call handover checklist ​

Contact escalation ​

Related ​

On-Call Rotation & Alerting

Rotation

Severity levels

Alert routing

Alerting rules

P0 alerts (page immediately)

P1 alerts (page during business hours)

P2 alerts (Slack only)

Incident response process

First 5 minutes (P0/P1)

Next 15 minutes

Mitigation → resolution

Post-mortem template

On-call handover checklist

Contact escalation

Related