Enterprises are drowning in documents. Financial statements arrive as PDFs. Contracts exist in scanned images. Invoices, forms, and reports accumulate in formats that resist automation. According to IDC, unstructured data accounts for 80-90% of all enterprise data, yet most of it remains locked away from analytical systems and business processes.
Traditional document processing relied on rigid templates and rule-based OCR. These approaches break the moment a vendor changes their invoice format or a new contract type arrives. Large language models with vision capabilities have fundamentally changed what's possible: systems that understand document structure, extract meaning from context, and adapt to new formats without reprogramming.
TL;DR: Document intelligence combines vision-capable LLMs with traditional extraction techniques to convert unstructured documents into structured data. Modern approaches can handle complex layouts, tables, handwriting, and multi-page documents — with accuracy rates varying by document type and complexity (typically 85-95% for well-structured documents, lower for handwritten or degraded scans). The key is building validation into your pipeline from the start, using Pydantic schemas to enforce structure and catch extraction errors before they propagate downstream.
The Document Intelligence Challenge
Document processing isn't a new problem. What's changed is the sophistication of available tools and the expectations for accuracy and flexibility.
Why Traditional OCR Falls Short
Optical Character Recognition (OCR) converts images to text, but text alone doesn't capture document meaning. A bank statement is more than characters on a page: it has accounts, transactions, dates, and balances arranged in a specific logical structure.
Traditional approaches fail for several reasons:
- Layout blindness — OCR produces a stream of text without understanding columns, tables, or sections
- Template dependency — Rule-based extractors break when document formats change
- Context ignorance — Systems can't infer meaning from surrounding content
- No semantic understanding — "Total" near a number doesn't automatically link them
What LLMs Bring to Document Processing
Vision-capable LLMs (GPT-4 Vision, Claude, Gemini) can look at a document image and understand it much like a human would. They recognize:
- Document structure — Headers, sections, paragraphs, lists, tables
- Semantic relationships — Which values belong to which labels
- Context from content — Inferring document type and purpose from the text itself
- Flexibility across formats — Same model handles invoices, contracts, and forms
This doesn't mean traditional techniques are obsolete. The best pipelines combine specialized extraction tools with LLM understanding. For an overview of how LLMs work with external data sources, see our RAG guide.
Architecture Approaches
There's no single correct architecture for document intelligence. The right choice depends on your document types, accuracy requirements, and infrastructure constraints.
| Approach | How It Works | Best For | Limitations |
|---|---|---|---|
| Vision-First | Send document images directly to vision LLM | Complex layouts, mixed content, rapid prototyping | Higher cost per page, token limits on large documents |
| Text Extraction + LLM | OCR first, then LLM structures the text | Text-heavy documents, cost optimization | Loses layout information, struggles with tables |
| Hybrid Pipeline | Specialized tools for tables/layout, LLM for semantics | Production systems, high accuracy requirements | More complex to build and maintain |
Vision-First Approach
The simplest architecture sends document images directly to a vision-capable LLM with instructions for what to extract. This works remarkably well for many use cases and requires minimal preprocessing.
Advantages:
- Handles any document layout without configuration
- Understands visual context (stamps, signatures, logos)
- Fast to implement and iterate
Disadvantages:
- Cost scales with page count
- Large documents may exceed token limits
- Less precise for dense tabular data
Text Extraction + LLM
Extract text first using OCR or PDF text extraction, then use an LLM to structure and interpret the content. Lower cost per document but loses spatial relationships.
Hybrid Pipeline
Production systems often combine approaches: specialized table extraction for financial data, layout analysis for document segmentation, and LLMs for semantic understanding and validation. This provides the best accuracy but requires more engineering investment.
Building an Extraction Pipeline
A production document intelligence pipeline typically includes these stages: ingestion, layout analysis, content extraction, LLM structuring, and validation. Let's walk through each with code examples.
Document Ingestion
The first step is loading documents and converting them to a processable format. For PDFs, this might mean extracting text, rendering pages as images, or both.
import fitz # PyMuPDF
from pathlib import Path
from dataclasses import dataclass
@dataclass
class DocumentPage:
page_number: int
text: str
image_path: Path | None = None
def load_pdf(pdf_path: str, render_images: bool = True) -> list[DocumentPage]:
"""Load PDF and extract text and optionally render page images."""
doc = fitz.open(pdf_path)
pages = []
for page_num, page in enumerate(doc):
# Extract text with layout preservation
text = page.get_text("text")
image_path = None
if render_images:
# Render page as image for vision models
pix = page.get_pixmap(matrix=fitz.Matrix(2, 2)) # 2x zoom for quality
image_path = Path(f"/tmp/page_{page_num}.png")
pix.save(str(image_path))
pages.append(DocumentPage(
page_number=page_num + 1,
text=text,
image_path=image_path
))
doc.close()
return pages
Layout Analysis
Understanding document structure before extraction improves accuracy. Layout analysis identifies headers, paragraphs, tables, and other structural elements.
For complex documents, consider using specialized layout models like LayoutLM or document AI services that return bounding boxes and element classifications. For simpler cases, heuristics based on text positioning often suffice.
Table Extraction
Tables are notoriously difficult for general-purpose extraction. Purpose-built tools significantly outperform both OCR and vision LLMs for dense tabular data.
import pdfplumber
from typing import Any
def extract_tables(pdf_path: str) -> list[dict[str, Any]]:
"""Extract tables from PDF using pdfplumber."""
tables = []
with pdfplumber.open(pdf_path) as pdf:
for page_num, page in enumerate(pdf.pages):
page_tables = page.extract_tables()
for table_idx, table in enumerate(page_tables):
if not table or len(table) < 2:
continue
# First row as headers, rest as data
headers = [str(h).strip() if h else f"col_{i}"
for i, h in enumerate(table[0])]
rows = []
for row in table[1:]:
row_dict = {}
for i, cell in enumerate(row):
header = headers[i] if i < len(headers) else f"col_{i}"
row_dict[header] = str(cell).strip() if cell else ""
rows.append(row_dict)
tables.append({
"page": page_num + 1,
"table_index": table_idx,
"headers": headers,
"rows": rows
})
return tables
LLM-Powered Structuring
The LLM's role is to interpret extracted content and produce structured output. Using structured output modes (JSON mode or function calling) ensures consistent, parseable results.
import anthropic
import base64
from pathlib import Path
def extract_with_vision(
image_path: Path,
extraction_prompt: str,
schema_description: str
) -> dict:
"""Extract structured data from document image using Claude."""
client = anthropic.Anthropic()
# Load and encode image
with open(image_path, "rb") as f:
image_data = base64.standard_b64encode(f.read()).decode("utf-8")
# Determine media type
suffix = image_path.suffix.lower()
media_type = {
".png": "image/png",
".jpg": "image/jpeg",
".jpeg": "image/jpeg"
}.get(suffix, "image/png")
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": media_type,
"data": image_data
}
},
{
"type": "text",
"text": f"""{extraction_prompt}
Return the extracted data as JSON matching this schema:
{schema_description}
Respond with only valid JSON, no additional text."""
}
]
}
]
)
import json
return json.loads(response.content[0].text)
Validation with Pydantic
Validation is critical. LLMs can hallucinate fields, misformat dates, or return unexpected structures. Pydantic schemas enforce data contracts and provide clear error messages when extraction fails.
from pydantic import BaseModel, Field, field_validator
from datetime import date
from decimal import Decimal
from typing import Optional
class Transaction(BaseModel):
date: date
description: str
amount: Decimal
transaction_type: str = Field(pattern="^(credit|debit)$")
@field_validator("amount", mode="before")
@classmethod
def parse_amount(cls, v):
"""Handle various amount formats."""
if isinstance(v, str):
# Remove currency symbols and commas
cleaned = v.replace("$", "").replace(",", "").replace(" ", "")
return Decimal(cleaned)
return v
class BankStatement(BaseModel):
account_number: str
statement_period_start: date
statement_period_end: date
opening_balance: Decimal
closing_balance: Decimal
transactions: list[Transaction]
@field_validator("account_number")
@classmethod
def validate_account(cls, v):
"""Ensure account number format."""
cleaned = v.replace(" ", "").replace("-", "")
if not cleaned.isdigit() or len(cleaned) < 8:
raise ValueError("Invalid account number format")
return cleaned
def validate_extraction(raw_data: dict, schema: type[BaseModel]) -> BaseModel:
"""Validate extracted data against Pydantic schema."""
try:
return schema.model_validate(raw_data)
except Exception as e:
# Log validation error for review
print(f"Validation failed: {e}")
raise
Complete Pipeline Example
Here's how these components fit together in a complete extraction pipeline:
from pathlib import Path
from dataclasses import dataclass
from typing import TypeVar, Generic
from pydantic import BaseModel
T = TypeVar("T", bound=BaseModel)
@dataclass
class ExtractionResult(Generic[T]):
success: bool
data: T | None
errors: list[str]
confidence: float
raw_response: dict
class DocumentExtractor:
def __init__(self, model: str = "claude-sonnet-4-20250514"):
self.model = model
def extract(
self,
pdf_path: str,
schema: type[T],
extraction_prompt: str
) -> ExtractionResult[T]:
"""Full extraction pipeline with validation."""
errors = []
# 1. Load document
pages = load_pdf(pdf_path, render_images=True)
# 2. Extract tables (for documents with tabular data)
tables = extract_tables(pdf_path)
# 3. Build context for LLM
context = self._build_context(pages, tables)
# 4. Extract with vision (using first page image)
if pages and pages[0].image_path:
try:
raw_data = extract_with_vision(
pages[0].image_path,
extraction_prompt,
schema.model_json_schema()
)
except Exception as e:
errors.append(f"Extraction failed: {e}")
return ExtractionResult(
success=False,
data=None,
errors=errors,
confidence=0.0,
raw_response={}
)
# 5. Validate against schema
try:
validated = schema.model_validate(raw_data)
return ExtractionResult(
success=True,
data=validated,
errors=[],
confidence=0.95,
raw_response=raw_data
)
except Exception as e:
errors.append(f"Validation failed: {e}")
return ExtractionResult(
success=False,
data=None,
errors=errors,
confidence=0.0,
raw_response=raw_data
)
def _build_context(self, pages, tables) -> str:
"""Combine page text and table data into context."""
context_parts = []
for page in pages:
context_parts.append(f"--- Page {page.page_number} ---\n{page.text}")
if tables:
context_parts.append("\n--- Extracted Tables ---")
for table in tables:
context_parts.append(f"Table on page {table['page']}:")
context_parts.append(str(table['rows'][:5])) # First 5 rows
return "\n\n".join(context_parts)
Handling Complex Document Types
Different document categories require different extraction strategies. Here's guidance for common enterprise document types.
Financial Statements
Bank statements, invoices, and financial reports share common challenges: dense tables, precise numeric requirements, and strict validation needs.
- Use specialized table extraction — pdfplumber or Camelot for transaction tables
- Validate totals — Sum extracted transactions and compare to stated totals
- Handle multi-page tables — Track headers across page breaks
- Currency parsing — Normalize formats (1,234.56 vs 1.234,56)
Contracts and Legal Documents
Legal documents prioritize text extraction accuracy and section identification over tabular data.
- Section detection — Identify clauses, parties, dates, and key terms
- Cross-reference handling — "As defined in Section 3.2"
- Signature detection — Identify signed vs unsigned documents
- Amendment tracking — Link amendments to original agreements
Forms and Applications
Structured forms with fields and checkboxes benefit from layout understanding.
- Field-value pairing — Match labels to their filled values
- Checkbox detection — Identify checked vs unchecked options
- Handwriting recognition — Vision models handle handwritten entries
- Required field validation — Ensure mandatory fields are populated
Production Considerations
Moving document intelligence from prototype to production requires attention to accuracy measurement, cost management, and data security.
Accuracy Measurement
You can't improve what you don't measure. Build evaluation into your pipeline from day one.
- Ground truth datasets — Manually label a representative sample of documents
- Field-level accuracy — Track accuracy per field, not just per document
- Confidence scoring — Flag low-confidence extractions for human review
- Error categorization — Distinguish OCR errors, schema mismatches, and hallucinations
Target accuracy depends on downstream use. Automated processing might require 99%+ accuracy, while human-in-the-loop workflows can tolerate lower thresholds with review queues.
Cost Optimization
Document processing costs can escalate quickly at scale. Optimize through:
- Tiered processing — Use cheaper models for simple documents, expensive models for complex ones
- Caching — Cache extractions for documents you've seen before
- Batching — Process multiple pages in single API calls where possible
- Resolution tuning — Lower image resolution for text-heavy documents
Typical costs range from $0.01-0.10 per page depending on complexity and model choice.
Security and Compliance
Documents often contain sensitive information. Address security early:
- Data residency — Ensure processing happens in appropriate jurisdictions
- PII detection — Identify and handle personal information appropriately
- Audit logging — Track what was extracted from which documents
- Retention policies — Define how long extracted data is stored
For GDPR and similar regulations, consider on-premises processing or providers with appropriate certifications.
Tools and Services Landscape
The document intelligence market includes cloud services, open-source tools, and specialized platforms.
| Category | Options | Strengths | Best For |
|---|---|---|---|
| Cloud AI Services | Azure Document Intelligence, AWS Textract, Google Document AI | Pre-built models, managed infrastructure, high accuracy | Standard document types, quick deployment |
| Vision LLMs | GPT-4 Vision, Claude, Gemini | Flexible, handles any layout, semantic understanding | Complex/varied documents, custom extraction |
| Open Source | PyMuPDF, pdfplumber, Camelot, Tesseract, LayoutLM | No API costs, full control, on-premises deployment | Cost-sensitive, data sovereignty requirements |
| IDP Platforms | UiPath, ABBYY, Kofax, Hyperscience | End-to-end workflows, enterprise features | High-volume enterprise processing |
Cloud Service Comparison
| Service | Pre-built Models | Custom Training | Table Extraction | Pricing Model |
|---|---|---|---|---|
| Azure Document Intelligence | Invoices, receipts, IDs, contracts | Yes (custom models) | Excellent | Per page ($1.50-15/1000 pages) |
| AWS Textract | Forms, tables, invoices, IDs | Limited (queries) | Good | Per page ($1.50-15/1000 pages) |
| Google Document AI | Invoices, contracts, procurement | Yes (custom processors) | Good | Per page (varies by processor) |
For applications requiring semantic search over extracted content, consider integrating with a vector database to enable retrieval across your document corpus.
How Virtido Can Help You Build Document Intelligence Systems
At Virtido, we help enterprises design and implement document processing pipelines that extract reliable, structured data from complex documents, with particular expertise in FinTech and financial services applications.
What We Offer
- Pipeline architecture design — Choosing the right combination of OCR, layout analysis, and LLM components for your document types
- Custom extraction development — Building extractors for invoices, statements, contracts, and domain-specific documents
- Validation and quality assurance — Implementing Pydantic schemas, confidence scoring, and human-in-the-loop review workflows
- Integration with existing systems — Connecting extraction pipelines to your ERP, accounting, or data warehouse systems
- AI/ML talent on demand — Data engineers and ML specialists to build and maintain your document processing infrastructure
We've delivered document intelligence solutions for clients across FinTech, insurance, legal services, and enterprise operations. Our staff augmentation model provides vetted engineers in 2-4 weeks with Swiss contracts and full IP protection.
Final Thoughts
Document intelligence has reached a practical inflection point. Vision-capable LLMs can now understand document layouts, extract semantic meaning, and handle format variations that would have required months of rule-based programming just a few years ago. The gap between what's technically possible and what's deployed in production is narrowing rapidly.
Success comes from combining the right tools: specialized extraction for tables and structured data, vision models for semantic understanding and flexibility, and rigorous validation to catch errors before they propagate. Pydantic schemas aren't just nice to have; they're essential for building systems you can trust with real business data.
Start with a focused use case and measurable accuracy targets. Build evaluation into your pipeline from day one. And remember that the goal isn't perfect extraction from any possible document; it's reliable extraction from your specific documents that integrates cleanly with your existing systems and workflows.
Frequently Asked Questions
What accuracy rates can I expect from LLM-based document extraction?
Well-implemented systems typically achieve 90-98% accuracy on structured fields like dates, amounts, and account numbers. Accuracy depends heavily on document quality, consistency of formats, and the complexity of what you're extracting. Free-form text fields and handwritten content have lower accuracy. Always measure accuracy on your specific document types before committing to production deployment.
Should I use vision models or text extraction for document processing?
It depends on your documents. Vision models excel at complex layouts, mixed content (text + images + tables), and documents where spatial relationships matter. Text extraction is cheaper and works well for text-heavy documents with simple layouts. Many production systems use both: specialized table extraction for structured data, vision models for semantic understanding and validation.
How do I handle scanned PDFs vs native digital PDFs?
Native PDFs contain embedded text that can be extracted directly, while scanned PDFs are essentially images requiring OCR. Check if a PDF has extractable text first using libraries like PyMuPDF. For scanned documents, either run OCR preprocessing (Tesseract, cloud OCR services) or send page images directly to vision-capable LLMs. Vision models can handle both, but direct text extraction is faster and cheaper when available.
Can LLMs accurately extract handwritten text from documents?
Vision LLMs (GPT-4 Vision, Claude, Gemini) can read handwritten text with moderate accuracy, typically 80-90% for clear handwriting and significantly lower for poor handwriting. For critical handwritten fields, implement confidence scoring and human review for low-confidence extractions. Dedicated handwriting recognition models may outperform general vision LLMs for specific use cases.
What's the best approach for extracting tables from documents?
Use specialized table extraction tools (pdfplumber, Camelot, Tabula) rather than relying solely on LLMs for dense tabular data. These tools understand table structure at the layout level and handle merged cells, spanning headers, and complex grids better than general-purpose models. Use LLMs to interpret what the extracted tables mean and validate that extracted values are sensible.
How do I ensure GDPR compliance when processing documents containing personal data?
Key considerations include data residency (process in EU or with EU-based providers), data minimization (extract only what you need), retention limits (delete extracted data when no longer needed), and consent/legal basis documentation. Some cloud AI services offer GDPR-compliant processing options. For maximum control, consider on-premises processing using open-source tools. Always implement audit logging to track what data was extracted from which documents.
What does document extraction cost per page?
Costs vary significantly by approach. Cloud document AI services (Azure, AWS, Google) charge $0.001-0.015 per page. Vision LLM calls cost $0.01-0.10+ per page depending on model and image resolution. Open-source pipelines have infrastructure costs but no per-page API fees. A typical production system processing 10,000 pages/month might cost $100-500/month for cloud services, or significantly less with open-source tools and existing infrastructure.
How do I handle multi-page documents where information spans pages?
Multi-page handling requires careful design. Options include: processing pages independently and merging results (simplest but may miss cross-page context), concatenating text from all pages before extraction (works for text-heavy documents), or using models with long context windows that can process multiple page images. For tables that span pages, track headers and continue row extraction across page boundaries.
Can I extract data from images embedded within documents?
Yes. Vision LLMs can interpret embedded images, charts, and diagrams. For charts and graphs, they can often extract the underlying data or describe trends. For images like signatures, stamps, or logos, they can detect presence and sometimes identify content. The accuracy depends on image quality and complexity. Consider extracting and processing embedded images separately for higher precision on critical visual elements.
How should I validate extracted data to catch errors?
Implement multiple validation layers: Pydantic schemas enforce data types and formats, business rules catch logical errors (e.g., transaction sum should match stated total), confidence thresholds flag uncertain extractions for review, and cross-document validation identifies outliers. For financial documents, always validate that calculated totals match extracted totals. Build a human review queue for low-confidence extractions rather than accepting potentially wrong data.