Skip to content

Service Level Objectives

BreezyCorp's SLOs define what "working" means in production. They drive the readiness probe (3F.5), alerting rules (3K.7), and the error-budget policy below.

Availability

ServiceTargetMeasurement window
API /portal/*, /ops/*, /admin/*99.5% monthlyExcludes planned maintenance windows (max 2h/month, announced ≥ 48h in advance)
Worker job throughput99.0% monthlyJobs that eventually complete within 4× their expected runtime
Magic-link delivery99.0% of requests → email accepted by SMTP within 5 min

Error budget: 99.5% availability = 3h 40m of permitted downtime per 30-day month. See the error-budget policy below.

Latency

Measured at the ALB / in @fastify/metrics per-route histograms.

OperationP50P95P99
Reads (GET)< 150 ms< 500 ms< 1500 ms
Writes (POST/PUT)< 400 ms< 2000 ms< 5000 ms
File presign< 200 ms< 500 ms< 1000 ms
OCR pipeline (Google Vision + Claude)< 8 s< 15 s (upload → extraction row)< 60 s
Validation run (worker)< 10 s< 30 s< 60 s

Latency SLOs exclude anything that requires the payroll engine (Infotech) to respond — those are measured separately as provider SLOs.

Data Durability

MetricTarget
RPO (Recovery Point Objective)≤ 5 min — via Postgres point-in-time recovery
RTO (Recovery Time Objective)≤ 1 h — rehearsed in the quarterly restore drill (see runbooks/restore-drill.md)
Audit event retention7 years (never purged by retention job)
File retention5 years from cycle archive (per IRAS) — see docs/retention-policy.md

Error Budget Policy

If any SLO burns its error budget for the current window:

  1. Burn rate > 2× for 1 hour — notify on-call (#spade-ops Slack)
  2. Burn rate > 10× for 15 minutes — page on-call (PagerDuty)
  3. Budget exhausted — all non-critical feature work pauses until the SLO is back in budget. The team focuses on reliability until then.

Critical security fixes and SOP compliance bugs are exempt from the feature freeze — they are reliability work by definition.

Non-goals

These are intentionally not covered by SLOs:

  • OCR provider accuracy (tracked separately as a model-quality metric)
  • Email provider (SMTP) uptime — we depend on it but don't own it; the outbox poller buffers during outages
  • Infotech payroll engine response time — treated as an external dependency
  • Swagger UI at /docs — not user-facing; no SLO

Owner & review

  • Owner: Platform team
  • Review cadence: quarterly, alongside the restore drill
  • Changes: SLO adjustments require a retrospective from the affected period + sign-off from Engineering lead

Internal use only — BreezyCorp