Runbook: Graceful Restart (API + Worker)

API graceful restart

Kubernetes sends SIGTERM
Readiness probe (/health/ready) starts returning 503 → load balancer removes the pod from rotation
Fastify app.close() waits for in-flight requests to drain (30s timeout)
Prisma + pg-boss clients disconnect
Process exits 0

Kubernetes configuration

yaml

terminationGracePeriodSeconds: 35  # 30s drain + 5s cushion
lifecycle:
  preStop:
    exec:
      command: ["sh", "-c", "sleep 5"]  # gives ALB time to notice readiness flip

Commands

bash

# Rolling restart
kubectl rollout restart deployment/spade-api
kubectl rollout status deployment/spade-api

# Watch readiness during restart
watch 'curl -s https://api.spade/health/ready | jq'

Worker graceful restart

Longer drain than the API — OCR jobs can legitimately take minutes.

SIGTERM → stop polling pg-boss (boss.stop({ graceful: true, timeout: 120_000 }))
Long-running handlers check shutdownSignal.aborted at safe yield points
In-flight jobs complete OR hit timeout → pg-boss retries on the next worker
Prisma disconnect → process exits

Kubernetes configuration

yaml

terminationGracePeriodSeconds: 125  # 120s drain + 5s cushion

Commands

bash

kubectl rollout restart deployment/spade-worker
kubectl rollout status deployment/spade-worker --timeout=3m

Orphaned jobs

Any job killed mid-execution is handled by pg-boss's existing retry mechanism. Handlers are required to be idempotent on retry — if a handler has a side effect that isn't safe to repeat, that's a bug, not a runbook issue.

Watching the restart

bash

# Aggregate logs across pods
kubectl logs -l app=spade-api --since=2m -f
kubectl logs -l app=spade-worker --since=2m -f

Look for:

Shutting down worker... / Shutting down API...
pg-boss stopped
No request errors during the drain window

Failure modes

API pods stuck in Terminating

Most common cause: a handler is blocking on something that won't complete before the 30s timeout (stuck DB query, infinite retry loop). Fix: adjust the timeout or find the handler bug.

bash

# Force kill after grace period
kubectl delete pod <pod-name> --grace-period=0 --force

Worker jobs duplicating

If a worker is force-killed before pg-boss marks the job complete, the same job may run twice. Handlers must be idempotent (keyed on the job payload's natural id). If duplicates are causing real problems, that's a handler bug.

SLO implications

Graceful restarts should not burn error budget. The readiness probe flipping to 503 tells the load balancer to stop sending traffic before requests fail. If you see 5xxs during a restart, the probe isn't flipping fast enough — investigate.

apps/api/src/index.ts — SIGTERM handler
apps/worker/src/index.ts — SIGTERM handler
docs/slo.md § Availability
apps/api/src/routes/health.ts — readiness probe implementation

Runbook: Graceful Restart (API + Worker) ​

API graceful restart ​

Kubernetes configuration ​

Commands ​

Worker graceful restart ​

Kubernetes configuration ​

Commands ​

Orphaned jobs ​

Watching the restart ​

Failure modes ​

API pods stuck in Terminating ​

Worker jobs duplicating ​

SLO implications ​

Related ​

Runbook: Graceful Restart (API + Worker)

API graceful restart

Kubernetes configuration

Commands

Worker graceful restart

Kubernetes configuration

Commands

Orphaned jobs

Watching the restart

Failure modes

API pods stuck in Terminating

Worker jobs duplicating

SLO implications

Related