Service Level Objectives

BreezyCorp's SLOs define what "working" means in production. They drive the readiness probe (3F.5), alerting rules (3K.7), and the error-budget policy below.

Availability

Service	Target	Measurement window
API `/portal/`, `/ops/`, `/admin/*`	99.5% monthly	Excludes planned maintenance windows (max 2h/month, announced ≥ 48h in advance)
Worker job throughput	99.0% monthly	Jobs that eventually complete within 4× their expected runtime
Magic-link delivery	99.0% of requests → email accepted by SMTP within 5 min

Error budget: 99.5% availability = 3h 40m of permitted downtime per 30-day month. See the error-budget policy below.

Latency

Measured at the ALB / in @fastify/metrics per-route histograms.

Operation	P50	P95	P99
Reads (GET)	< 150 ms	< 500 ms	< 1500 ms
Writes (POST/PUT)	< 400 ms	< 2000 ms	< 5000 ms
File presign	< 200 ms	< 500 ms	< 1000 ms
OCR pipeline (Google Vision + Claude)	< 8 s	< 15 s (upload → extraction row)	< 60 s
Validation run (worker)	< 10 s	< 30 s	< 60 s

Latency SLOs exclude anything that requires the payroll engine (Infotech) to respond — those are measured separately as provider SLOs.

Data Durability

Metric	Target
RPO (Recovery Point Objective)	≤ 5 min — via Postgres point-in-time recovery
RTO (Recovery Time Objective)	≤ 1 h — rehearsed in the quarterly restore drill (see `runbooks/restore-drill.md`)
Audit event retention	7 years (never purged by retention job)
File retention	5 years from cycle archive (per IRAS) — see `docs/retention-policy.md`

Error Budget Policy

If any SLO burns its error budget for the current window:

Burn rate > 2× for 1 hour — notify on-call (#spade-ops Slack)
Burn rate > 10× for 15 minutes — page on-call (PagerDuty)
Budget exhausted — all non-critical feature work pauses until the SLO is back in budget. The team focuses on reliability until then.

Critical security fixes and SOP compliance bugs are exempt from the feature freeze — they are reliability work by definition.

Non-goals

These are intentionally not covered by SLOs:

OCR provider accuracy (tracked separately as a model-quality metric)
Email provider (SMTP) uptime — we depend on it but don't own it; the outbox poller buffers during outages
Infotech payroll engine response time — treated as an external dependency
Swagger UI at /docs — not user-facing; no SLO

Owner & review

Owner: Platform team
Review cadence: quarterly, alongside the restore drill
Changes: SLO adjustments require a retrospective from the affected period + sign-off from Engineering lead

Service Level Objectives ​

Availability ​

Latency ​

Data Durability ​

Error Budget Policy ​

Non-goals ​