Skip to content

Runbook: OCR Pipeline Failing

Alert: spade_ocr_failures / spade_ocr_total > 10% over 15 min Severity: Warning

Symptoms

  • Extraction rows landing with status: 'FAILED' or confidence: 0
  • Worker logs showing Document classification FAILED at adapter or Document extraction FAILED at adapter
  • Portal clients unable to upload documents (finalize 415s — separate from OCR)

Diagnose

1. Recent failures

sql
SELECT
  de.status,
  COUNT(*) AS cnt,
  MAX(de.id) AS sample
FROM document_extractions de
JOIN files f ON f.id = de.file_id
WHERE f.created_at > NOW() - INTERVAL '15 minutes'
GROUP BY de.status;
  • Lots of FAILED → adapter / provider error; check the error line in worker logs
  • Lots of COMPLETED with confidence_summary → overall = 0 → Claude ran but returned nulls (doc type mismatched the schema, not a pipeline bug)
  • OVERSIZED classifications → legitimate large-file rejections at fetch time

2. Check S3 reachability

bash
# From the worker pod / host
aws s3 ls s3://breezycorp/ --region $REGION

If this fails → S3 credentials / network issue, not an OCR problem.

3. Check which adapter is active

bash
grep OCR_PROVIDER .env  # or check env in pod
  • vision+claude → production path: Google Cloud Vision for OCR text + Claude tool-use for structured field extraction (packages/documents/src/adapters/composite.ts)
  • mock → CI / credential-less-dev fallback, never used in production

4. Inspect a specific failed extraction

sql
SELECT
  de.extractor_key,
  de.status,
  de.confidence_summary,
  jsonb_pretty(de.raw_payload_json)
FROM document_extractions de
WHERE de.status = 'FAILED'
ORDER BY de.id DESC
LIMIT 1;

The raw_payload_json contains the adapter's stage marker ({ stage: 'vision', provider: 'gcp-vision' } means Vision blew up before Claude ran; anything with provider: 'anthropic' means Vision succeeded and Claude failed).

The human-readable error is in the worker's stdout — grep for Document classification FAILED at adapter or Document extraction FAILED at adapter near the timestamp.

Common causes & fixes

A. Google Cloud Vision credential corruption

Error line: DECODER routines::unsupported or bad base64 decode. Meaning: the GOOGLE_VISION_CREDENTIALS_JSON blob in the worker env has a mangled private key — usually a missing char from a manual copy-paste. Fix: regenerate the service account key in GCP Console → IAM → Service Accounts → Keys, replace the env var with the fresh JSON (copy-paste the entire downloaded file in one go), restart the worker.

B. Anthropic API auth failure

Error line: 401 invalid x-api-key or Claude extraction failed. Fix: rotate ANTHROPIC_API_KEY (console.anthropic.com → API Keys), update worker env, restart.

C. S3 network issue

Check VPC peering, security groups, IAM. The worker must have GetObject on the bucket. See apps/worker/src/lib/s3.ts for client config.

D. Provider outage (GCP or Anthropic)

Temporarily flip OCR_PROVIDER=mock in the worker env and restart. OCR rows will still be created (classified as GENERIC_DOCUMENT with a single canned field), so the rest of the pipeline keeps moving — PE will need to manually re-classify or re-run OCR once the provider recovers.

E. Oversized files

packages/documents/src/s3-fetcher.ts hard-caps at 50 MB. If legitimate documents exceed this, update MAX_FILE_SIZE in file-validator.ts and bump the fetcher cap. Don't raise one without the other.

Reclassify after a fix

For files that failed due to a transient issue, re-enqueue:

sql
-- Find the failed files
SELECT f.id, f.storage_key
FROM files f
JOIN document_extractions de ON de.file_id = f.id
WHERE de.status = 'FAILED' AND f.created_at > NOW() - INTERVAL '1 hour';

Then either (a) re-upload the file through the portal so the API enqueues a fresh ocr-process job, or (b) INSERT a pg-boss job manually:

sql
SELECT pgboss.send('ocr-process', jsonb_build_object('fileId', '<fileId>', 'cycleId', '<cycleId>'));

Escalate if

  • Provider-side outage exceeds 30 minutes
  • File corruption is detected (hash mismatch between S3 and DB sha256)
  • Failure rate climbs above 25% (systemic problem, not transient)
  • apps/worker/src/handlers/ocr-process.ts
  • packages/documents/src/adapters/README.md
  • docs/slo.md § OCR pipeline SLO
  • docs/runbooks/secret-rotation.md § OCR provider credential rotation

Internal use only — BreezyCorp