Runbook: Graceful Restart (API + Worker)
API graceful restart
- Kubernetes sends SIGTERM
- Readiness probe (
/health/ready) starts returning 503 → load balancer removes the pod from rotation - Fastify
app.close()waits for in-flight requests to drain (30s timeout) - Prisma + pg-boss clients disconnect
- Process exits 0
Kubernetes configuration
terminationGracePeriodSeconds: 35 # 30s drain + 5s cushion
lifecycle:
preStop:
exec:
command: ["sh", "-c", "sleep 5"] # gives ALB time to notice readiness flipCommands
# Rolling restart
kubectl rollout restart deployment/spade-api
kubectl rollout status deployment/spade-api
# Watch readiness during restart
watch 'curl -s https://api.spade/health/ready | jq'Worker graceful restart
Longer drain than the API — OCR jobs can legitimately take minutes.
- SIGTERM → stop polling pg-boss (
boss.stop({ graceful: true, timeout: 120_000 })) - Long-running handlers check
shutdownSignal.abortedat safe yield points - In-flight jobs complete OR hit timeout → pg-boss retries on the next worker
- Prisma disconnect → process exits
Kubernetes configuration
terminationGracePeriodSeconds: 125 # 120s drain + 5s cushionCommands
kubectl rollout restart deployment/spade-worker
kubectl rollout status deployment/spade-worker --timeout=3mOrphaned jobs
Any job killed mid-execution is handled by pg-boss's existing retry mechanism. Handlers are required to be idempotent on retry — if a handler has a side effect that isn't safe to repeat, that's a bug, not a runbook issue.
Watching the restart
# Aggregate logs across pods
kubectl logs -l app=spade-api --since=2m -f
kubectl logs -l app=spade-worker --since=2m -fLook for:
Shutting down worker.../Shutting down API...pg-boss stopped- No request errors during the drain window
Failure modes
API pods stuck in Terminating
Most common cause: a handler is blocking on something that won't complete before the 30s timeout (stuck DB query, infinite retry loop). Fix: adjust the timeout or find the handler bug.
# Force kill after grace period
kubectl delete pod <pod-name> --grace-period=0 --forceWorker jobs duplicating
If a worker is force-killed before pg-boss marks the job complete, the same job may run twice. Handlers must be idempotent (keyed on the job payload's natural id). If duplicates are causing real problems, that's a handler bug.
SLO implications
Graceful restarts should not burn error budget. The readiness probe flipping to 503 tells the load balancer to stop sending traffic before requests fail. If you see 5xxs during a restart, the probe isn't flipping fast enough — investigate.
Related
apps/api/src/index.ts— SIGTERM handlerapps/worker/src/index.ts— SIGTERM handlerdocs/slo.md§ Availabilityapps/api/src/routes/health.ts— readiness probe implementation