Skip to content

Runbook: Outbox Stuck / Dead-Lettered Events

Alert: spade_outbox_pending > 100 OR spade_outbox_dead_letter_count > 0Severity: Warning → Critical if dead-letter count rises

Symptoms

  • Clients report missing magic-link emails
  • Approval requests not being sent
  • Outbox table keeps growing

Diagnose

1. Check the pending backlog

sql
SELECT
  event_type,
  COUNT(*) AS pending,
  MIN(published_at) AS oldest
FROM outbox_events
WHERE processed_at IS NULL
GROUP BY event_type
ORDER BY pending DESC;
  • If oldest is more than 5 minutes old and growing → poller isn't running
  • If events are accumulating only for one event_type → provider for that type is down

2. Check dead-lettered events

sql
SELECT id, event_type, last_error, attempts, last_error_at
FROM outbox_events
WHERE dead_letter = true
ORDER BY last_error_at DESC
LIMIT 20;

Read last_error to identify the root cause.

3. Check the worker

bash
# Is the worker running?
kubectl get pods -l app=spade-worker
# Recent logs
kubectl logs -l app=spade-worker --tail=200 | grep outbox

Look for:

  • outbox-poll job execution logs
  • HTTP / SMTP error messages
  • pg-boss connection errors

4. Check SMTP / external provider

bash
# Mailpit (dev) or the prod SMTP health endpoint
curl -sf https://smtp.prod/health

Common causes & fixes

A. SMTP provider down

  • Circuit breaker in packages/notifications/src/adapters/smtp.ts should open after 5 failures and start rejecting fast
  • Events continue to accrue in the outbox — no data loss
  • Action: wait for provider restoration; outbox will drain naturally once the circuit closes

B. Worker pod crashed / not running

bash
kubectl rollout restart deployment/spade-worker

Then watch logs until outbox_pending starts trending down.

C. Dead-lettered permanently

For an individual event that's legitimately broken (e.g., invalid recipient email):

bash
# Retry after fixing the underlying data
curl -X POST https://api.spade/ops/outbox/$EVENT_ID/retry \
  -H "Cookie: breezycorp_staff_session=$SESSION"

For events that should never retry (e.g., client deleted since enqueue):

sql
-- Explicitly mark as processed with a reason
UPDATE outbox_events
SET processed_at = NOW(), last_error = 'manually dismissed: <reason>'
WHERE id = '<event-id>';

Record the manual dismissal in the incident report.

D. pg-boss connection pool exhausted

Check DATABASE_URL pool settings. Default max is 10; workers may need 20+ under load. See docs/capacity.md.

Escalate if

  • Dead-lettered count climbs beyond 50 without a clear root cause
  • SMTP provider outage exceeds 1 hour (activate secondary provider per runbooks/secret-rotation.md)
  • Events older than 1 hour with pg-boss running normally — poller has a bug, file critical incident
  • packages/notifications/src/adapters/smtp.ts — SMTP circuit breaker
  • apps/worker/src/handlers/outbox-poller.ts — poller + retry logic
  • docs/slo.md § Magic-link delivery SLO

Internal use only — BreezyCorp