Service Level Objectives
BreezyCorp's SLOs define what "working" means in production. They drive the readiness probe (3F.5), alerting rules (3K.7), and the error-budget policy below.
Availability
| Service | Target | Measurement window |
|---|---|---|
API /portal/*, /ops/*, /admin/* | 99.5% monthly | Excludes planned maintenance windows (max 2h/month, announced ≥ 48h in advance) |
| Worker job throughput | 99.0% monthly | Jobs that eventually complete within 4× their expected runtime |
| Magic-link delivery | 99.0% of requests → email accepted by SMTP within 5 min |
Error budget: 99.5% availability = 3h 40m of permitted downtime per 30-day month. See the error-budget policy below.
Latency
Measured at the ALB / in @fastify/metrics per-route histograms.
| Operation | P50 | P95 | P99 |
|---|---|---|---|
| Reads (GET) | < 150 ms | < 500 ms | < 1500 ms |
| Writes (POST/PUT) | < 400 ms | < 2000 ms | < 5000 ms |
| File presign | < 200 ms | < 500 ms | < 1000 ms |
| OCR pipeline (Google Vision + Claude) | < 8 s | < 15 s (upload → extraction row) | < 60 s |
| Validation run (worker) | < 10 s | < 30 s | < 60 s |
Latency SLOs exclude anything that requires the payroll engine (Infotech) to respond — those are measured separately as provider SLOs.
Data Durability
| Metric | Target |
|---|---|
| RPO (Recovery Point Objective) | ≤ 5 min — via Postgres point-in-time recovery |
| RTO (Recovery Time Objective) | ≤ 1 h — rehearsed in the quarterly restore drill (see runbooks/restore-drill.md) |
| Audit event retention | 7 years (never purged by retention job) |
| File retention | 5 years from cycle archive (per IRAS) — see docs/retention-policy.md |
Error Budget Policy
If any SLO burns its error budget for the current window:
- Burn rate > 2× for 1 hour — notify on-call (#spade-ops Slack)
- Burn rate > 10× for 15 minutes — page on-call (PagerDuty)
- Budget exhausted — all non-critical feature work pauses until the SLO is back in budget. The team focuses on reliability until then.
Critical security fixes and SOP compliance bugs are exempt from the feature freeze — they are reliability work by definition.
Non-goals
These are intentionally not covered by SLOs:
- OCR provider accuracy (tracked separately as a model-quality metric)
- Email provider (SMTP) uptime — we depend on it but don't own it; the outbox poller buffers during outages
- Infotech payroll engine response time — treated as an external dependency
- Swagger UI at
/docs— not user-facing; no SLO
Owner & review
- Owner: Platform team
- Review cadence: quarterly, alongside the restore drill
- Changes: SLO adjustments require a retrospective from the affected period + sign-off from Engineering lead