Document Intelligence with LLMs: Extracting Structure from Unstructured Data [2026]

Written by Virtido | Feb 24, 2026 10:00:00 AM

Enterprises are drowning in documents. Financial statements arrive as PDFs. Contracts exist in scanned images. Invoices, forms, and reports accumulate in formats that resist automation. According to IDC, unstructured data accounts for 80-90% of all enterprise data, yet most of it remains locked away from analytical systems and business processes.

Traditional document processing relied on rigid templates and rule-based OCR. These approaches break the moment a vendor changes their invoice format or a new contract type arrives. Large language models with vision capabilities have fundamentally changed what's possible: systems that understand document structure, extract meaning from context, and adapt to new formats without reprogramming.

TL;DR: Document intelligence combines vision-capable LLMs with traditional extraction techniques to convert unstructured documents into structured data. Modern approaches can handle complex layouts, tables, handwriting, and multi-page documents — with accuracy rates varying by document type and complexity (typically 85-95% for well-structured documents, lower for handwritten or degraded scans). The key is building validation into your pipeline from the start, using Pydantic schemas to enforce structure and catch extraction errors before they propagate downstream.

The Document Intelligence Challenge

Document processing isn't a new problem. What's changed is the sophistication of available tools and the expectations for accuracy and flexibility.

Why Traditional OCR Falls Short

Optical Character Recognition (OCR) converts images to text, but text alone doesn't capture document meaning. A bank statement is more than characters on a page: it has accounts, transactions, dates, and balances arranged in a specific logical structure.

Traditional approaches fail for several reasons:

Layout blindness — OCR produces a stream of text without understanding columns, tables, or sections
Template dependency — Rule-based extractors break when document formats change
Context ignorance — Systems can't infer meaning from surrounding content
No semantic understanding — "Total" near a number doesn't automatically link them

What LLMs Bring to Document Processing

Vision-capable LLMs (GPT-4 Vision, Claude, Gemini) can look at a document image and understand it much like a human would. They recognize:

Document structure — Headers, sections, paragraphs, lists, tables
Semantic relationships — Which values belong to which labels
Context from content — Inferring document type and purpose from the text itself
Flexibility across formats — Same model handles invoices, contracts, and forms

This doesn't mean traditional techniques are obsolete. The best pipelines combine specialized extraction tools with LLM understanding. For an overview of how LLMs work with external data sources, see our RAG guide.

Architecture Approaches

There's no single correct architecture for document intelligence. The right choice depends on your document types, accuracy requirements, and infrastructure constraints.

Approach	How It Works	Best For	Limitations
Vision-First	Send document images directly to vision LLM	Complex layouts, mixed content, rapid prototyping	Higher cost per page, token limits on large documents
Text Extraction + LLM	OCR first, then LLM structures the text	Text-heavy documents, cost optimization	Loses layout information, struggles with tables
Hybrid Pipeline	Specialized tools for tables/layout, LLM for semantics	Production systems, high accuracy requirements	More complex to build and maintain

Vision-First Approach

The simplest architecture sends document images directly to a vision-capable LLM with instructions for what to extract. This works remarkably well for many use cases and requires minimal preprocessing.

Advantages:

Handles any document layout without configuration
Understands visual context (stamps, signatures, logos)
Fast to implement and iterate

Disadvantages:

Cost scales with page count
Large documents may exceed token limits
Less precise for dense tabular data

Text Extraction + LLM

Extract text first using OCR or PDF text extraction, then use an LLM to structure and interpret the content. Lower cost per document but loses spatial relationships.

Hybrid Pipeline

Production systems often combine approaches: specialized table extraction for financial data, layout analysis for document segmentation, and LLMs for semantic understanding and validation. This provides the best accuracy but requires more engineering investment.

Building an Extraction Pipeline

A production document intelligence pipeline typically includes these stages: ingestion, layout analysis, content extraction, LLM structuring, and validation. Let's walk through each with code examples.

Document Ingestion

The first step is loading documents and converting them to a processable format. For PDFs, this might mean extracting text, rendering pages as images, or both.

import fitz  # PyMuPDF
from pathlib import Path
from dataclasses import dataclass

@dataclass
class DocumentPage:
    page_number: int
    text: str
    image_path: Path | None = None

def load_pdf(pdf_path: str, render_images: bool = True) -> list[DocumentPage]:
    """Load PDF and extract text and optionally render page images."""
    doc = fitz.open(pdf_path)
    pages = []

    for page_num, page in enumerate(doc):
        # Extract text with layout preservation
        text = page.get_text("text")

        image_path = None
        if render_images:
            # Render page as image for vision models
            pix = page.get_pixmap(matrix=fitz.Matrix(2, 2))  # 2x zoom for quality
            image_path = Path(f"/tmp/page_{page_num}.png")
            pix.save(str(image_path))

        pages.append(DocumentPage(
            page_number=page_num + 1,
            text=text,
            image_path=image_path
        ))

    doc.close()
    return pages

Layout Analysis

Understanding document structure before extraction improves accuracy. Layout analysis identifies headers, paragraphs, tables, and other structural elements.

For complex documents, consider using specialized layout models like LayoutLM or document AI services that return bounding boxes and element classifications. For simpler cases, heuristics based on text positioning often suffice.

Table Extraction

Tables are notoriously difficult for general-purpose extraction. Purpose-built tools significantly outperform both OCR and vision LLMs for dense tabular data.

import pdfplumber
from typing import Any

def extract_tables(pdf_path: str) -> list[dict[str, Any]]:
    """Extract tables from PDF using pdfplumber."""
    tables = []

    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages):
            page_tables = page.extract_tables()

            for table_idx, table in enumerate(page_tables):
                if not table or len(table) < 2:
                    continue

                # First row as headers, rest as data
                headers = [str(h).strip() if h else f"col_{i}"
                          for i, h in enumerate(table[0])]
                rows = []

                for row in table[1:]:
                    row_dict = {}
                    for i, cell in enumerate(row):
                        header = headers[i] if i < len(headers) else f"col_{i}"
                        row_dict[header] = str(cell).strip() if cell else ""
                    rows.append(row_dict)

                tables.append({
                    "page": page_num + 1,
                    "table_index": table_idx,
                    "headers": headers,
                    "rows": rows
                })

    return tables

LLM-Powered Structuring

The LLM's role is to interpret extracted content and produce structured output. Using structured output modes (JSON mode or function calling) ensures consistent, parseable results.

import anthropic
import base64
from pathlib import Path

def extract_with_vision(
    image_path: Path,
    extraction_prompt: str,
    schema_description: str
) -> dict:
    """Extract structured data from document image using Claude."""
    client = anthropic.Anthropic()

    # Load and encode image
    with open(image_path, "rb") as f:
        image_data = base64.standard_b64encode(f.read()).decode("utf-8")

    # Determine media type
    suffix = image_path.suffix.lower()
    media_type = {
        ".png": "image/png",
        ".jpg": "image/jpeg",
        ".jpeg": "image/jpeg"
    }.get(suffix, "image/png")

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": media_type,
                            "data": image_data
                        }
                    },
                    {
                        "type": "text",
                        "text": f"""{extraction_prompt}

Return the extracted data as JSON matching this schema:
{schema_description}

Respond with only valid JSON, no additional text."""
                    }
                ]
            }
        ]
    )

    import json
    return json.loads(response.content[0].text)

Validation with Pydantic

Validation is critical. LLMs can hallucinate fields, misformat dates, or return unexpected structures. Pydantic schemas enforce data contracts and provide clear error messages when extraction fails.

from pydantic import BaseModel, Field, field_validator
from datetime import date
from decimal import Decimal
from typing import Optional

class Transaction(BaseModel):
    date: date
    description: str
    amount: Decimal
    transaction_type: str = Field(pattern="^(credit|debit)$")

    @field_validator("amount", mode="before")
    @classmethod
    def parse_amount(cls, v):
        """Handle various amount formats."""
        if isinstance(v, str):
            # Remove currency symbols and commas
            cleaned = v.replace("$", "").replace(",", "").replace(" ", "")
            return Decimal(cleaned)
        return v

class BankStatement(BaseModel):
    account_number: str
    statement_period_start: date
    statement_period_end: date
    opening_balance: Decimal
    closing_balance: Decimal
    transactions: list[Transaction]

    @field_validator("account_number")
    @classmethod
    def validate_account(cls, v):
        """Ensure account number format."""
        cleaned = v.replace(" ", "").replace("-", "")
        if not cleaned.isdigit() or len(cleaned) < 8:
            raise ValueError("Invalid account number format")
        return cleaned

def validate_extraction(raw_data: dict, schema: type[BaseModel]) -> BaseModel:
    """Validate extracted data against Pydantic schema."""
    try:
        return schema.model_validate(raw_data)
    except Exception as e:
        # Log validation error for review
        print(f"Validation failed: {e}")
        raise

Complete Pipeline Example

Here's how these components fit together in a complete extraction pipeline:

from pathlib import Path
from dataclasses import dataclass
from typing import TypeVar, Generic
from pydantic import BaseModel

T = TypeVar("T", bound=BaseModel)

@dataclass
class ExtractionResult(Generic[T]):
    success: bool
    data: T | None
    errors: list[str]
    confidence: float
    raw_response: dict

class DocumentExtractor:
    def __init__(self, model: str = "claude-sonnet-4-20250514"):
        self.model = model

    def extract(
        self,
        pdf_path: str,
        schema: type[T],
        extraction_prompt: str
    ) -> ExtractionResult[T]:
        """Full extraction pipeline with validation."""
        errors = []

        # 1. Load document
        pages = load_pdf(pdf_path, render_images=True)

        # 2. Extract tables (for documents with tabular data)
        tables = extract_tables(pdf_path)

        # 3. Build context for LLM
        context = self._build_context(pages, tables)

        # 4. Extract with vision (using first page image)
        if pages and pages[0].image_path:
            try:
                raw_data = extract_with_vision(
                    pages[0].image_path,
                    extraction_prompt,
                    schema.model_json_schema()
                )
            except Exception as e:
                errors.append(f"Extraction failed: {e}")
                return ExtractionResult(
                    success=False,
                    data=None,
                    errors=errors,
                    confidence=0.0,
                    raw_response={}
                )

        # 5. Validate against schema
        try:
            validated = schema.model_validate(raw_data)
            return ExtractionResult(
                success=True,
                data=validated,
                errors=[],
                confidence=0.95,
                raw_response=raw_data
            )
        except Exception as e:
            errors.append(f"Validation failed: {e}")
            return ExtractionResult(
                success=False,
                data=None,
                errors=errors,
                confidence=0.0,
                raw_response=raw_data
            )

    def _build_context(self, pages, tables) -> str:
        """Combine page text and table data into context."""
        context_parts = []

        for page in pages:
            context_parts.append(f"--- Page {page.page_number} ---\n{page.text}")

        if tables:
            context_parts.append("\n--- Extracted Tables ---")
            for table in tables:
                context_parts.append(f"Table on page {table['page']}:")
                context_parts.append(str(table['rows'][:5]))  # First 5 rows

        return "\n\n".join(context_parts)

Handling Complex Document Types

Different document categories require different extraction strategies. Here's guidance for common enterprise document types.

Financial Statements

Bank statements, invoices, and financial reports share common challenges: dense tables, precise numeric requirements, and strict validation needs.

Use specialized table extraction — pdfplumber or Camelot for transaction tables
Validate totals — Sum extracted transactions and compare to stated totals
Handle multi-page tables — Track headers across page breaks
Currency parsing — Normalize formats (1,234.56 vs 1.234,56)

Contracts and Legal Documents

Legal documents prioritize text extraction accuracy and section identification over tabular data.

Section detection — Identify clauses, parties, dates, and key terms
Cross-reference handling — "As defined in Section 3.2"
Signature detection — Identify signed vs unsigned documents
Amendment tracking — Link amendments to original agreements

Forms and Applications

Structured forms with fields and checkboxes benefit from layout understanding.

Field-value pairing — Match labels to their filled values
Checkbox detection — Identify checked vs unchecked options
Handwriting recognition — Vision models handle handwritten entries
Required field validation — Ensure mandatory fields are populated

Production Considerations

Moving document intelligence from prototype to production requires attention to accuracy measurement, cost management, and data security.

Accuracy Measurement

You can't improve what you don't measure. Build evaluation into your pipeline from day one.

Ground truth datasets — Manually label a representative sample of documents
Field-level accuracy — Track accuracy per field, not just per document
Confidence scoring — Flag low-confidence extractions for human review
Error categorization — Distinguish OCR errors, schema mismatches, and hallucinations

Target accuracy depends on downstream use. Automated processing might require 99%+ accuracy, while human-in-the-loop workflows can tolerate lower thresholds with review queues.

Cost Optimization

Document processing costs can escalate quickly at scale. Optimize through:

Tiered processing — Use cheaper models for simple documents, expensive models for complex ones
Caching — Cache extractions for documents you've seen before
Batching — Process multiple pages in single API calls where possible
Resolution tuning — Lower image resolution for text-heavy documents

Typical costs range from $0.01-0.10 per page depending on complexity and model choice.

Security and Compliance

Documents often contain sensitive information. Address security early:

Data residency — Ensure processing happens in appropriate jurisdictions
PII detection — Identify and handle personal information appropriately
Audit logging — Track what was extracted from which documents
Retention policies — Define how long extracted data is stored

For GDPR and similar regulations, consider on-premises processing or providers with appropriate certifications.

Tools and Services Landscape

The document intelligence market includes cloud services, open-source tools, and specialized platforms.

Category	Options	Strengths	Best For
Cloud AI Services	Azure Document Intelligence, AWS Textract, Google Document AI	Pre-built models, managed infrastructure, high accuracy	Standard document types, quick deployment
Vision LLMs	GPT-4 Vision, Claude, Gemini	Flexible, handles any layout, semantic understanding	Complex/varied documents, custom extraction
Open Source	PyMuPDF, pdfplumber, Camelot, Tesseract, LayoutLM	No API costs, full control, on-premises deployment	Cost-sensitive, data sovereignty requirements
IDP Platforms	UiPath, ABBYY, Kofax, Hyperscience	End-to-end workflows, enterprise features	High-volume enterprise processing

Cloud Service Comparison

Service	Pre-built Models	Custom Training	Table Extraction	Pricing Model
Azure Document Intelligence	Invoices, receipts, IDs, contracts	Yes (custom models)	Excellent	Per page ($1.50-15/1000 pages)
AWS Textract	Forms, tables, invoices, IDs	Limited (queries)	Good	Per page ($1.50-15/1000 pages)
Google Document AI	Invoices, contracts, procurement	Yes (custom processors)	Good	Per page (varies by processor)

For applications requiring semantic search over extracted content, consider integrating with a vector database to enable retrieval across your document corpus.

How Virtido Can Help You Build Document Intelligence Systems

At Virtido, we help enterprises design and implement document processing pipelines that extract reliable, structured data from complex documents, with particular expertise in FinTech and financial services applications.

What We Offer

Pipeline architecture design — Choosing the right combination of OCR, layout analysis, and LLM components for your document types
Custom extraction development — Building extractors for invoices, statements, contracts, and domain-specific documents
Validation and quality assurance — Implementing Pydantic schemas, confidence scoring, and human-in-the-loop review workflows
Integration with existing systems — Connecting extraction pipelines to your ERP, accounting, or data warehouse systems
AI/ML talent on demand — Data engineers and ML specialists to build and maintain your document processing infrastructure

We've delivered document intelligence solutions for clients across FinTech, insurance, legal services, and enterprise operations. Our staff augmentation model provides vetted engineers in 2-4 weeks with Swiss contracts and full IP protection.

Final Thoughts

Document intelligence has reached a practical inflection point. Vision-capable LLMs can now understand document layouts, extract semantic meaning, and handle format variations that would have required months of rule-based programming just a few years ago. The gap between what's technically possible and what's deployed in production is narrowing rapidly.

Success comes from combining the right tools: specialized extraction for tables and structured data, vision models for semantic understanding and flexibility, and rigorous validation to catch errors before they propagate. Pydantic schemas aren't just nice to have; they're essential for building systems you can trust with real business data.

Start with a focused use case and measurable accuracy targets. Build evaluation into your pipeline from day one. And remember that the goal isn't perfect extraction from any possible document; it's reliable extraction from your specific documents that integrates cleanly with your existing systems and workflows.

Frequently Asked Questions

What accuracy rates can I expect from LLM-based document extraction?

Well-implemented systems typically achieve 90-98% accuracy on structured fields like dates, amounts, and account numbers. Accuracy depends heavily on document quality, consistency of formats, and the complexity of what you're extracting. Free-form text fields and handwritten content have lower accuracy. Always measure accuracy on your specific document types before committing to production deployment.

Should I use vision models or text extraction for document processing?

It depends on your documents. Vision models excel at complex layouts, mixed content (text + images + tables), and documents where spatial relationships matter. Text extraction is cheaper and works well for text-heavy documents with simple layouts. Many production systems use both: specialized table extraction for structured data, vision models for semantic understanding and validation.

How do I handle scanned PDFs vs native digital PDFs?

Native PDFs contain embedded text that can be extracted directly, while scanned PDFs are essentially images requiring OCR. Check if a PDF has extractable text first using libraries like PyMuPDF. For scanned documents, either run OCR preprocessing (Tesseract, cloud OCR services) or send page images directly to vision-capable LLMs. Vision models can handle both, but direct text extraction is faster and cheaper when available.

Can LLMs accurately extract handwritten text from documents?

Vision LLMs (GPT-4 Vision, Claude, Gemini) can read handwritten text with moderate accuracy, typically 80-90% for clear handwriting and significantly lower for poor handwriting. For critical handwritten fields, implement confidence scoring and human review for low-confidence extractions. Dedicated handwriting recognition models may outperform general vision LLMs for specific use cases.

What's the best approach for extracting tables from documents?

Use specialized table extraction tools (pdfplumber, Camelot, Tabula) rather than relying solely on LLMs for dense tabular data. These tools understand table structure at the layout level and handle merged cells, spanning headers, and complex grids better than general-purpose models. Use LLMs to interpret what the extracted tables mean and validate that extracted values are sensible.

How do I ensure GDPR compliance when processing documents containing personal data?

Key considerations include data residency (process in EU or with EU-based providers), data minimization (extract only what you need), retention limits (delete extracted data when no longer needed), and consent/legal basis documentation. Some cloud AI services offer GDPR-compliant processing options. For maximum control, consider on-premises processing using open-source tools. Always implement audit logging to track what data was extracted from which documents.

What does document extraction cost per page?

Costs vary significantly by approach. Cloud document AI services (Azure, AWS, Google) charge $0.001-0.015 per page. Vision LLM calls cost $0.01-0.10+ per page depending on model and image resolution. Open-source pipelines have infrastructure costs but no per-page API fees. A typical production system processing 10,000 pages/month might cost $100-500/month for cloud services, or significantly less with open-source tools and existing infrastructure.

How do I handle multi-page documents where information spans pages?

Multi-page handling requires careful design. Options include: processing pages independently and merging results (simplest but may miss cross-page context), concatenating text from all pages before extraction (works for text-heavy documents), or using models with long context windows that can process multiple page images. For tables that span pages, track headers and continue row extraction across page boundaries.

Can I extract data from images embedded within documents?

Yes. Vision LLMs can interpret embedded images, charts, and diagrams. For charts and graphs, they can often extract the underlying data or describe trends. For images like signatures, stamps, or logos, they can detect presence and sometimes identify content. The accuracy depends on image quality and complexity. Consider extracting and processing embedded images separately for higher precision on critical visual elements.

How should I validate extracted data to catch errors?

Implement multiple validation layers: Pydantic schemas enforce data types and formats, business rules catch logical errors (e.g., transaction sum should match stated total), confidence thresholds flag uncertain extractions for review, and cross-document validation identifies outliers. For financial documents, always validate that calculated totals match extracted totals. Build a human review queue for low-confidence extractions rather than accepting potentially wrong data.

View full post