Capacity Planning

Initial sizing derived from the Phase 3K.4 load test (tests/load/month-end-burst.js). Update this doc after each run of the load test in staging.

Reference workload

50 clients with payroll cycles
Month-end burst: 80% of monthly submissions arrive within a 2-hour window on the 1st of the month
Files per cycle: ~20 documents averaging 2 MB each
Peak concurrent API requests: ~50 RPS during the burst, ~5 RPS off-peak

API sizing (initial)

Setting	Value	Rationale
Replicas	2	Redundancy + rolling deploys
CPU request / limit	250m / 1000m	Fastify is cheap; burst headroom for OCR coordination
Memory request / limit	256Mi / 768Mi	Prisma client + ExcelJS render footprint
Readiness probe	Every 10s	Drives load balancer rotation during restart

Worker sizing (initial)

Setting	Value	Rationale
Replicas	2	pg-boss supports multi-worker out of the box
CPU request / limit	500m / 2000m	OCR + xlsx parsing are CPU-bound
Memory request / limit	512Mi / 1.5Gi	Excel workbooks can be large in memory
pg-boss `teamSize.ocr-process`	4	Per worker — 8 total concurrent OCR jobs
pg-boss `teamSize.reminder-email`	2	Lower priority
pg-boss `teamSize.outbox-poll`	1	Singleton; pg-boss dedupes
pg-boss `teamSize.validation-run`	2	Per worker — 4 total; validation is fast

Database (Postgres)

Setting	Value	Rationale
Instance class	`db.t3.medium` (or equivalent) initially	2 vCPU / 4 GB; upgrade based on CPU/connections metrics
`max_connections`	100	API 2 × 10 + worker 2 × 20 + buffer
PgBouncer pool (transaction mode)	50 per app pod	Keeps the raw connection count low
Storage	50 GB with auto-grow	20% buffer over current

S3 (file uploads)

No sizing — pay-per-use. Monitor the monthly transfer cost metric for unexpected spikes (a leak or a pentest worth noting).

Known bottlenecks (to monitor)

pg-boss job queue depth — if it builds up during the burst, raise teamSize.ocr-process
Postgres connection saturation — alert at 80% of max_connections
ExcelJS memory during large output parse — 100+ MB workbooks can OOM the worker; watch the container memory usage during month-end

How to re-run the load test

bash

# 1. Spin up a staging environment with production-like data
terraform apply -workspace=staging

# 2. Point k6 at it
BASE_URL=https://staging-api.spade \
STAFF_EMAIL=admin@spade.local \
STAFF_PASSWORD="$(get-secret staging-admin-pass)" \
  k6 run tests/load/month-end-burst.js

# 3. Compare the results against the SLO thresholds in the test file
# 4. If any threshold fails, update the sizing above BEFORE merging to main

When to update this doc

After every load test run (update the observed numbers)
When adding significant client load (each new client ≈ 50 more submissions/month)
After any major schema change (ExcelJS / Prisma footprint shifts)
When an incident reveals an unknown bottleneck

Capacity Planning ​

Reference workload ​

API sizing (initial) ​

Worker sizing (initial) ​

Database (Postgres) ​

S3 (file uploads) ​

Known bottlenecks (to monitor) ​

How to re-run the load test ​

When to update this doc ​