Runbook: OCR Pipeline Failing
Alert: spade_ocr_failures / spade_ocr_total > 10% over 15 min Severity: Warning
Symptoms
- Extraction rows landing with
status: 'FAILED'orconfidence: 0 - Worker logs showing
Document classification FAILED at adapterorDocument extraction FAILED at adapter - Portal clients unable to upload documents (finalize 415s — separate from OCR)
Diagnose
1. Recent failures
SELECT
de.status,
COUNT(*) AS cnt,
MAX(de.id) AS sample
FROM document_extractions de
JOIN files f ON f.id = de.file_id
WHERE f.created_at > NOW() - INTERVAL '15 minutes'
GROUP BY de.status;- Lots of
FAILED→ adapter / provider error; check the error line in worker logs - Lots of
COMPLETEDwithconfidence_summary → overall = 0→ Claude ran but returned nulls (doc type mismatched the schema, not a pipeline bug) OVERSIZEDclassifications → legitimate large-file rejections at fetch time
2. Check S3 reachability
# From the worker pod / host
aws s3 ls s3://breezycorp/ --region $REGIONIf this fails → S3 credentials / network issue, not an OCR problem.
3. Check which adapter is active
grep OCR_PROVIDER .env # or check env in podvision+claude→ production path: Google Cloud Vision for OCR text + Claude tool-use for structured field extraction (packages/documents/src/adapters/composite.ts)mock→ CI / credential-less-dev fallback, never used in production
4. Inspect a specific failed extraction
SELECT
de.extractor_key,
de.status,
de.confidence_summary,
jsonb_pretty(de.raw_payload_json)
FROM document_extractions de
WHERE de.status = 'FAILED'
ORDER BY de.id DESC
LIMIT 1;The raw_payload_json contains the adapter's stage marker ({ stage: 'vision', provider: 'gcp-vision' } means Vision blew up before Claude ran; anything with provider: 'anthropic' means Vision succeeded and Claude failed).
The human-readable error is in the worker's stdout — grep for Document classification FAILED at adapter or Document extraction FAILED at adapter near the timestamp.
Common causes & fixes
A. Google Cloud Vision credential corruption
Error line: DECODER routines::unsupported or bad base64 decode. Meaning: the GOOGLE_VISION_CREDENTIALS_JSON blob in the worker env has a mangled private key — usually a missing char from a manual copy-paste. Fix: regenerate the service account key in GCP Console → IAM → Service Accounts → Keys, replace the env var with the fresh JSON (copy-paste the entire downloaded file in one go), restart the worker.
B. Anthropic API auth failure
Error line: 401 invalid x-api-key or Claude extraction failed. Fix: rotate ANTHROPIC_API_KEY (console.anthropic.com → API Keys), update worker env, restart.
C. S3 network issue
Check VPC peering, security groups, IAM. The worker must have GetObject on the bucket. See apps/worker/src/lib/s3.ts for client config.
D. Provider outage (GCP or Anthropic)
Temporarily flip OCR_PROVIDER=mock in the worker env and restart. OCR rows will still be created (classified as GENERIC_DOCUMENT with a single canned field), so the rest of the pipeline keeps moving — PE will need to manually re-classify or re-run OCR once the provider recovers.
E. Oversized files
packages/documents/src/s3-fetcher.ts hard-caps at 50 MB. If legitimate documents exceed this, update MAX_FILE_SIZE in file-validator.ts and bump the fetcher cap. Don't raise one without the other.
Reclassify after a fix
For files that failed due to a transient issue, re-enqueue:
-- Find the failed files
SELECT f.id, f.storage_key
FROM files f
JOIN document_extractions de ON de.file_id = f.id
WHERE de.status = 'FAILED' AND f.created_at > NOW() - INTERVAL '1 hour';Then either (a) re-upload the file through the portal so the API enqueues a fresh ocr-process job, or (b) INSERT a pg-boss job manually:
SELECT pgboss.send('ocr-process', jsonb_build_object('fileId', '<fileId>', 'cycleId', '<cycleId>'));Escalate if
- Provider-side outage exceeds 30 minutes
- File corruption is detected (hash mismatch between S3 and DB
sha256) - Failure rate climbs above 25% (systemic problem, not transient)
Related
apps/worker/src/handlers/ocr-process.tspackages/documents/src/adapters/README.mddocs/slo.md§ OCR pipeline SLOdocs/runbooks/secret-rotation.md§ OCR provider credential rotation