Skip to content

On-Call Rotation & Alerting

Rotation

  • Primary on-call: 1 engineer, 1-week rotation, handover Mondays 09:00 SGT
  • Secondary on-call: 1 engineer, same cadence, covers primary unavailability
  • Escalation: Engineering lead → CTO for P0 incidents
  • Hours: 24/7 for P0/P1, business hours only for P2/P3

Severity levels

LevelDefinitionResponse time
P0Data loss, security breach, full outage5 min acknowledge · page immediately
P1Partial outage, SLO burn rate > 10×, blocked client workflow15 min · page during business hours, Slack after-hours
P2Degraded performance, SLO burn rate > 2×, non-blocking bugs1 hour · Slack only
P3Minor bugs, cosmetic issuesNext business day

Alert routing

  • #spade-ops (Slack) — all warnings, all P2/P3, passive observation
  • PagerDuty — P0/P1 only; rotates to primary → secondary → Engineering lead
  • Email to platform@spade — daily digest of warnings for trend review

Alerting rules

These are the Prometheus / Grafana alert definitions. The thresholds derive from docs/slo.md.

P0 alerts (page immediately)

yaml
- alert: API_Down
  expr: up{job="spade-api"} == 0
  for: 2m
  labels: { severity: critical }
  annotations:
    summary: "Spade API is down"
    runbook: "docs/runbooks/graceful-restart.md"

- alert: Database_Unreachable
  expr: spade_db_ready == 0
  for: 1m
  labels: { severity: critical }
  annotations:
    summary: "Postgres unreachable from API"
    runbook: "docs/runbooks/restore-drill.md"

- alert: Tenant_Leak_Suspected
  expr: increase(spade_cross_tenant_read_unexpected_total[5m]) > 0
  labels: { severity: critical }
  annotations:
    summary: "Unexpected cross-tenant read detected"
    runbook: "docs/runbooks/tenant-leak-suspected.md"

- alert: Backup_Missed
  expr: time() - spade_last_backup_timestamp > 28 * 3600
  labels: { severity: critical }
  annotations:
    summary: "Nightly backup missed"
    runbook: "docs/backup-strategy.md"

P1 alerts (page during business hours)

yaml
- alert: High_Error_Rate
  expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
  for: 10m
  labels: { severity: high }
  annotations:
    summary: "API 5xx rate > 5% for 10 min"

- alert: SLO_Burn_Rate_High
  expr: slo_burn_rate{window="1h"} > 10
  for: 15m
  labels: { severity: high }
  annotations:
    summary: "Error budget burning 10x faster than target"
    runbook: "docs/slo.md#error-budget-policy"

- alert: Outbox_Dead_Lettered
  expr: spade_outbox_dead_letter_count > 0
  labels: { severity: high }
  annotations:
    summary: "Outbox events have been dead-lettered"
    runbook: "docs/runbooks/outbox-stuck.md"

- alert: Readiness_Probe_Failing
  expr: probe_success{job="readiness"} == 0
  for: 3m
  labels: { severity: high }
  annotations:
    summary: "API readiness probe failing"
    runbook: "docs/runbooks/graceful-restart.md"

P2 alerts (Slack only)

yaml
- alert: OCR_Failure_Rate
  expr: rate(spade_ocr_failures[15m]) / rate(spade_ocr_total[15m]) > 0.10
  for: 15m
  labels: { severity: warning }
  annotations:
    summary: "OCR failure rate > 10% over 15 min"
    runbook: "docs/runbooks/ocr-failing.md"

- alert: DB_Connection_Pool_Saturation
  expr: pg_stat_activity_count / pg_settings_max_connections > 0.8
  for: 5m
  labels: { severity: warning }
  annotations:
    summary: "Postgres connection pool > 80% saturated"
    runbook: "docs/capacity.md"

- alert: Outbox_Backlog
  expr: spade_outbox_pending > 100
  for: 10m
  labels: { severity: warning }
  annotations:
    summary: "Outbox backlog > 100 events"
    runbook: "docs/runbooks/outbox-stuck.md"

- alert: Validation_Latency_High
  expr: histogram_quantile(0.95, rate(spade_validation_duration_seconds_bucket[5m])) > 30
  for: 10m
  labels: { severity: warning }
  annotations:
    summary: "Validation run P95 > 30s"
    runbook: "docs/slo.md"

Incident response process

First 5 minutes (P0/P1)

  1. Acknowledge the page on PagerDuty to stop the re-page loop
  2. Open #incident-<date>-<short> Slack channel — invite primary + secondary + engineering lead
  3. Pin the incident commander (usually the on-call)
  4. State the observation in the channel (no diagnosis yet)
  5. Open the runbook linked in the alert

Next 15 minutes

  1. Declare severity and notify stakeholders via the severity's notification lane
  2. Start the timeline — a pinned message in the incident channel, updated as events unfold
  3. Follow the runbook — if the runbook doesn't match reality, note why in the timeline
  4. Mitigate first, fix second — rollback, scale, toggle a feature flag; the clean fix can wait

Mitigation → resolution

  1. Confirm recovery via the alert clearing AND a manual smoke test
  2. All-clear in #spade-ops with the recovery time
  3. Schedule the post-mortem — within 72 hours, blameless, published in docs/post-mortems/

Post-mortem template

TODO: create docs/templates/postmortem.md. Structure:

  • Summary (1 paragraph)
  • Impact (clients affected, duration, SLO burn)
  • Timeline (from first detection to all-clear)
  • Root cause (technical + process)
  • What went well
  • What didn't
  • Action items (with owners + due dates)

On-call handover checklist

Every Monday at 09:00 SGT, primary-outgoing → primary-incoming runs through:

  • [ ] Any open incidents / ongoing investigations
  • [ ] Any alerts fired in the past week (resolved or not)
  • [ ] Any runbook updates the incoming on-call should know about
  • [ ] Any known maintenance windows coming up
  • [ ] Any dependency announcements (provider status, upcoming rotations)

Contact escalation

Primary on-call → (5 min no-ack) → Secondary on-call
Secondary → (5 min no-ack) → Engineering lead
Engineering lead → CTO (P0 only)

Numbers and Slack handles live in the internal wiki (not committed here).

  • docs/slo.md — the SLOs driving these alerts
  • docs/runbooks/ — per-incident procedures
  • docs/capacity.md — sizing that informs the thresholds
  • docs/pii-inventory.md — what to protect during an incident

Internal use only — BreezyCorp