ai-document-processor — auditable document extraction pipeline
A format-agnostic ingestion pipeline where every extracted field traces back to its source span. PDF/DOCX/image → OCR → classify → extract → queryable JSONB.
The problem
A regulated-document pipeline has to be format-agnostic (text PDF, scanned PDF, JPEG, DOCX), type-aware (invoice vs contract vs receipt), and queryable downstream. It also has to self-host, which rules out hand-rolling a parser per format or buying a closed SaaS.
The solution
A two-tier AI pipeline: a cheap model classifies, a focused model extracts by type. Every extracted field traces to its source span. PyMuPDF + Tesseract handle text and OCR fallback. PostgreSQL JSONB stores extracted fields alongside TSVECTOR for free-text search. ~$0.006 per document.
- Constraint
- Self-hostable, format-agnostic, and type-aware. Extraction has to be queryable downstream and auditable per field.
- Decision
- Split the work: a cheap classifier first, then a focused extractor keyed to the document type. A free no-API-key text path means the system degrades instead of failing. Store JSONB + TSVECTOR so downstream queries hit one table.
- Outcome
- A live, self-hostable pipeline at ~$0.006/document. Every extracted field traces to its source span and is searchable in full text.
Overview
Upload PDFs, images, or DOCX files. The pipeline extracts text with an OCR fallback for scanned input. A cheap model classifies the document type with a confidence score. A focused model pulls structured fields (vendor, amount, dates, line items) into clean JSON. Results land in PostgreSQL with JSONB + TSVECTOR full-text search. Next.js dashboard, Docker Compose orchestration, one command to bring the stack up. With no API key it runs text extraction only, so the demo path is free.