Runbook: Outbox Stuck / Dead-Lettered Events
Alert: spade_outbox_pending > 100 OR spade_outbox_dead_letter_count > 0Severity: Warning → Critical if dead-letter count rises
Symptoms
- Clients report missing magic-link emails
- Approval requests not being sent
- Outbox table keeps growing
Diagnose
1. Check the pending backlog
sql
SELECT
event_type,
COUNT(*) AS pending,
MIN(published_at) AS oldest
FROM outbox_events
WHERE processed_at IS NULL
GROUP BY event_type
ORDER BY pending DESC;- If
oldestis more than 5 minutes old and growing → poller isn't running - If events are accumulating only for one
event_type→ provider for that type is down
2. Check dead-lettered events
sql
SELECT id, event_type, last_error, attempts, last_error_at
FROM outbox_events
WHERE dead_letter = true
ORDER BY last_error_at DESC
LIMIT 20;Read last_error to identify the root cause.
3. Check the worker
bash
# Is the worker running?
kubectl get pods -l app=spade-worker
# Recent logs
kubectl logs -l app=spade-worker --tail=200 | grep outboxLook for:
outbox-polljob execution logs- HTTP / SMTP error messages
- pg-boss connection errors
4. Check SMTP / external provider
bash
# Mailpit (dev) or the prod SMTP health endpoint
curl -sf https://smtp.prod/healthCommon causes & fixes
A. SMTP provider down
- Circuit breaker in
packages/notifications/src/adapters/smtp.tsshould open after 5 failures and start rejecting fast - Events continue to accrue in the outbox — no data loss
- Action: wait for provider restoration; outbox will drain naturally once the circuit closes
B. Worker pod crashed / not running
bash
kubectl rollout restart deployment/spade-workerThen watch logs until outbox_pending starts trending down.
C. Dead-lettered permanently
For an individual event that's legitimately broken (e.g., invalid recipient email):
bash
# Retry after fixing the underlying data
curl -X POST https://api.spade/ops/outbox/$EVENT_ID/retry \
-H "Cookie: breezycorp_staff_session=$SESSION"For events that should never retry (e.g., client deleted since enqueue):
sql
-- Explicitly mark as processed with a reason
UPDATE outbox_events
SET processed_at = NOW(), last_error = 'manually dismissed: <reason>'
WHERE id = '<event-id>';Record the manual dismissal in the incident report.
D. pg-boss connection pool exhausted
Check DATABASE_URL pool settings. Default max is 10; workers may need 20+ under load. See docs/capacity.md.
Escalate if
- Dead-lettered count climbs beyond 50 without a clear root cause
- SMTP provider outage exceeds 1 hour (activate secondary provider per
runbooks/secret-rotation.md) - Events older than 1 hour with pg-boss running normally — poller has a bug, file critical incident
Related
packages/notifications/src/adapters/smtp.ts— SMTP circuit breakerapps/worker/src/handlers/outbox-poller.ts— poller + retry logicdocs/slo.md§ Magic-link delivery SLO