Capacity Planning
Initial sizing derived from the Phase 3K.4 load test (tests/load/month-end-burst.js). Update this doc after each run of the load test in staging.
Reference workload
- 50 clients with payroll cycles
- Month-end burst: 80% of monthly submissions arrive within a 2-hour window on the 1st of the month
- Files per cycle: ~20 documents averaging 2 MB each
- Peak concurrent API requests: ~50 RPS during the burst, ~5 RPS off-peak
API sizing (initial)
| Setting | Value | Rationale |
|---|---|---|
| Replicas | 2 | Redundancy + rolling deploys |
| CPU request / limit | 250m / 1000m | Fastify is cheap; burst headroom for OCR coordination |
| Memory request / limit | 256Mi / 768Mi | Prisma client + ExcelJS render footprint |
| Readiness probe | Every 10s | Drives load balancer rotation during restart |
Worker sizing (initial)
| Setting | Value | Rationale |
|---|---|---|
| Replicas | 2 | pg-boss supports multi-worker out of the box |
| CPU request / limit | 500m / 2000m | OCR + xlsx parsing are CPU-bound |
| Memory request / limit | 512Mi / 1.5Gi | Excel workbooks can be large in memory |
pg-boss teamSize.ocr-process | 4 | Per worker — 8 total concurrent OCR jobs |
pg-boss teamSize.reminder-email | 2 | Lower priority |
pg-boss teamSize.outbox-poll | 1 | Singleton; pg-boss dedupes |
pg-boss teamSize.validation-run | 2 | Per worker — 4 total; validation is fast |
Database (Postgres)
| Setting | Value | Rationale |
|---|---|---|
| Instance class | db.t3.medium (or equivalent) initially | 2 vCPU / 4 GB; upgrade based on CPU/connections metrics |
max_connections | 100 | API 2 × 10 + worker 2 × 20 + buffer |
| PgBouncer pool (transaction mode) | 50 per app pod | Keeps the raw connection count low |
| Storage | 50 GB with auto-grow | 20% buffer over current |
S3 (file uploads)
No sizing — pay-per-use. Monitor the monthly transfer cost metric for unexpected spikes (a leak or a pentest worth noting).
Known bottlenecks (to monitor)
- pg-boss job queue depth — if it builds up during the burst, raise
teamSize.ocr-process - Postgres connection saturation — alert at 80% of
max_connections - ExcelJS memory during large output parse — 100+ MB workbooks can OOM the worker; watch the container memory usage during month-end
How to re-run the load test
bash
# 1. Spin up a staging environment with production-like data
terraform apply -workspace=staging
# 2. Point k6 at it
BASE_URL=https://staging-api.spade \
STAFF_EMAIL=admin@spade.local \
STAFF_PASSWORD="$(get-secret staging-admin-pass)" \
k6 run tests/load/month-end-burst.js
# 3. Compare the results against the SLO thresholds in the test file
# 4. If any threshold fails, update the sizing above BEFORE merging to mainWhen to update this doc
- After every load test run (update the observed numbers)
- When adding significant client load (each new client ≈ 50 more submissions/month)
- After any major schema change (ExcelJS / Prisma footprint shifts)
- When an incident reveals an unknown bottleneck