Skip to content

Runbook: Graceful Restart (API + Worker)

API graceful restart

  1. Kubernetes sends SIGTERM
  2. Readiness probe (/health/ready) starts returning 503 → load balancer removes the pod from rotation
  3. Fastify app.close() waits for in-flight requests to drain (30s timeout)
  4. Prisma + pg-boss clients disconnect
  5. Process exits 0

Kubernetes configuration

yaml
terminationGracePeriodSeconds: 35  # 30s drain + 5s cushion
lifecycle:
  preStop:
    exec:
      command: ["sh", "-c", "sleep 5"]  # gives ALB time to notice readiness flip

Commands

bash
# Rolling restart
kubectl rollout restart deployment/spade-api
kubectl rollout status deployment/spade-api

# Watch readiness during restart
watch 'curl -s https://api.spade/health/ready | jq'

Worker graceful restart

Longer drain than the API — OCR jobs can legitimately take minutes.

  1. SIGTERM → stop polling pg-boss (boss.stop({ graceful: true, timeout: 120_000 }))
  2. Long-running handlers check shutdownSignal.aborted at safe yield points
  3. In-flight jobs complete OR hit timeout → pg-boss retries on the next worker
  4. Prisma disconnect → process exits

Kubernetes configuration

yaml
terminationGracePeriodSeconds: 125  # 120s drain + 5s cushion

Commands

bash
kubectl rollout restart deployment/spade-worker
kubectl rollout status deployment/spade-worker --timeout=3m

Orphaned jobs

Any job killed mid-execution is handled by pg-boss's existing retry mechanism. Handlers are required to be idempotent on retry — if a handler has a side effect that isn't safe to repeat, that's a bug, not a runbook issue.

Watching the restart

bash
# Aggregate logs across pods
kubectl logs -l app=spade-api --since=2m -f
kubectl logs -l app=spade-worker --since=2m -f

Look for:

  • Shutting down worker... / Shutting down API...
  • pg-boss stopped
  • No request errors during the drain window

Failure modes

API pods stuck in Terminating

Most common cause: a handler is blocking on something that won't complete before the 30s timeout (stuck DB query, infinite retry loop). Fix: adjust the timeout or find the handler bug.

bash
# Force kill after grace period
kubectl delete pod <pod-name> --grace-period=0 --force

Worker jobs duplicating

If a worker is force-killed before pg-boss marks the job complete, the same job may run twice. Handlers must be idempotent (keyed on the job payload's natural id). If duplicates are causing real problems, that's a handler bug.

SLO implications

Graceful restarts should not burn error budget. The readiness probe flipping to 503 tells the load balancer to stop sending traffic before requests fail. If you see 5xxs during a restart, the probe isn't flipping fast enough — investigate.

  • apps/api/src/index.ts — SIGTERM handler
  • apps/worker/src/index.ts — SIGTERM handler
  • docs/slo.md § Availability
  • apps/api/src/routes/health.ts — readiness probe implementation

Internal use only — BreezyCorp