AI Integration · Document Processing
Document AI for Agencies: Extracting Structure from PDFs, Forms, and Contracts
Clients ask agencies to 'do something with these PDFs' more often than you'd think. Here's how to actually build document extraction pipelines that work in production: OCR, vision models, and structured output.
Anurag Verma
9 min read
Sponsored
The request comes in a few different shapes: “We have 10,000 invoices in a shared drive and need the totals in a spreadsheet.” “Clients upload signed contracts and we manually re-enter the key dates into our CRM.” “We get insurance forms as PDFs and our team types the data into our system.”
Every agency that works with business clients runs into document automation requests. They’re some of the most tractable AI projects: the inputs are defined, the outputs are defined, and the value is measurable in hours saved per week. They’re also easy to get wrong if you reach for the wrong tool.
Here’s what actually works in 2026.
The Three Document Types and Why They Matter
Not all PDFs are alike. Your extraction approach changes completely depending on what you’re working with.
Text PDFs contain machine-readable text embedded in the file. If you open the PDF in Acrobat and can select and copy text normally, you have a text PDF. These are straightforward: extract the text with a library like pypdf or pdfplumber and work with the string output. No OCR needed.
Scanned PDFs are images of documents. The PDF is a container for one or more JPEG or PNG images of scanned pages. There’s no extractable text. You need OCR.
Form PDFs (also called AcroForms) contain interactive form fields with their own metadata. pypdf can extract the field names and values directly without OCR or text parsing.
import pypdf
reader = pypdf.PdfReader("form.pdf")
# Check if it has interactive form fields
if reader.get_fields():
fields = reader.get_fields()
for field_name, field in fields.items():
print(f"{field_name}: {field.get('/V', 'empty')}")
else:
# No form fields — try text extraction
for page in reader.pages:
text = page.extract_text()
print(text)
Identify which type you have before choosing a tool. Applying OCR to a text PDF wastes compute and often produces worse output than simple text extraction.
Text Extraction for Machine-Readable PDFs
pdfplumber is the most capable library for text PDFs. It handles multi-column layouts, preserves spatial information, and can extract tables.
import pdfplumber
with pdfplumber.open("invoice.pdf") as pdf:
for page in pdf.pages:
# Full text extraction
text = page.extract_text()
# Table extraction — returns list of lists
tables = page.extract_tables()
for table in tables:
for row in table:
print(row)
# Extract text from a specific bounding box (in PDF points)
# Useful for known-layout documents
cropped = page.within_bbox((0, 0, 300, 200))
header_text = cropped.extract_text()
For documents with known, consistent layouts (the same invoice template repeated 10,000 times), bounding box extraction is reliable. You identify where the invoice number, date, and total always appear, crop to those regions, and extract.
For documents with variable layouts, you’ll need LLM-based parsing after extraction (more on that below).
OCR for Scanned Documents
When you have scanned PDFs, the standard open-source path is Tesseract via the pytesseract wrapper, with pdf2image to convert PDF pages to images first.
pip install pytesseract pdf2image pillow
# Also install Tesseract binary: apt-get install tesseract-ocr
from pdf2image import convert_from_path
import pytesseract
from PIL import Image
def ocr_pdf(path: str) -> str:
pages = convert_from_path(path, dpi=300)
full_text = []
for page_image in pages:
text = pytesseract.image_to_string(page_image, lang='eng')
full_text.append(text)
return '\n\n'.join(full_text)
DPI matters for Tesseract accuracy. 300 DPI is the minimum for reliable results. 150 DPI on a rotated or slightly blurry scan will produce garbage. If your clients are scanning physical documents, the document quality ceiling is wherever their scanner is set.
Tesseract accuracy on clean, well-formatted text is reasonable. It degrades quickly on:
- Handwriting (Tesseract is trained on printed text)
- Non-standard fonts
- Tables with thin borders
- Documents with background graphics or watermarks
For these cases, commercial OCR APIs (Google Document AI, AWS Textract, Azure Document Intelligence) outperform Tesseract substantially and handle tables and form structures natively.
import boto3
textract = boto3.client('textract', region_name='us-east-1')
with open("scanned-invoice.pdf", "rb") as f:
document = f.read()
response = textract.analyze_document(
Document={'Bytes': document},
FeatureTypes=['TABLES', 'FORMS']
)
# Extract key-value pairs (form fields)
for block in response['Blocks']:
if block['BlockType'] == 'KEY_VALUE_SET':
if 'KEY' in block.get('EntityTypes', []):
# Find the corresponding value
for rel in block.get('Relationships', []):
if rel['Type'] == 'VALUE':
# Process the value block
pass
AWS Textract charges per page (~$0.015 for text detection, more for forms/tables). For a client processing 10,000 invoices monthly, that’s $150-300/month in API costs, which is usually well under the cost of manual data entry.
Using Vision Models for Unstructured Documents
The real capability shift in 2025-2026 is using multimodal LLMs (models that accept images as input) for document extraction. You render the PDF page as an image and ask the model to extract specific fields.
This works where everything else fails: mixed handwritten and printed text, complex table layouts, documents where the semantic meaning matters rather than just the characters.
import anthropic
import base64
from pdf2image import convert_from_path
import json
client = anthropic.Anthropic()
def extract_invoice_data(pdf_path: str) -> dict:
pages = convert_from_path(pdf_path, dpi=150) # Lower DPI ok for vision models
# Convert first page to base64
import io
buffer = io.BytesIO()
pages[0].save(buffer, format='JPEG', quality=85)
image_data = base64.standard_b64encode(buffer.getvalue()).decode('utf-8')
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": image_data,
},
},
{
"type": "text",
"text": """Extract the following fields from this invoice image and return them as JSON:
{
"invoice_number": "string or null",
"invoice_date": "ISO date string or null",
"due_date": "ISO date string or null",
"vendor_name": "string or null",
"vendor_address": "string or null",
"total_amount": "number or null",
"currency": "3-letter code or null",
"line_items": [{"description": "string", "quantity": "number", "unit_price": "number", "total": "number"}]
}
Return only the JSON object, no explanation."""
}
],
}]
)
try:
return json.loads(response.content[0].text)
except json.JSONDecodeError:
# Model returned text with explanation — extract JSON
text = response.content[0].text
start = text.find('{')
end = text.rfind('}') + 1
return json.loads(text[start:end])
Vision model extraction costs more than Textract ($0.003-0.01 per page depending on the model and document size vs $0.015-0.05 for Textract with features). But vision models handle documents that Textract can’t: handwritten forms, documents in languages with poor Textract support, mixed-format documents.
For high-volume pipelines, use Textract or Google Document AI for standard forms and fall back to vision models only for documents that fail structured extraction.
Structured Output to Prevent Parsing Nightmares
The extraction step gives you text or an image. The parsing step turns that raw content into structured data. Prompt engineering with JSON output is the reliable path: tell the model exactly what schema you need and validate the output.
For production pipelines, use Pydantic models for validation:
from pydantic import BaseModel, validator
from typing import Optional
from datetime import date
import json
class LineItem(BaseModel):
description: str
quantity: float
unit_price: float
total: float
class Invoice(BaseModel):
invoice_number: Optional[str]
invoice_date: Optional[date]
due_date: Optional[date]
vendor_name: Optional[str]
total_amount: Optional[float]
currency: Optional[str] = "USD"
line_items: list[LineItem] = []
def parse_invoice_response(raw_json: str) -> Invoice:
try:
data = json.loads(raw_json)
return Invoice(**data)
except Exception as e:
# Log the failure, flag for human review
raise ValueError(f"Extraction failed validation: {e}")
When the LLM returns a field it isn’t sure about, it often returns null. Build your pipeline to handle missing fields gracefully rather than treating every null as a failure. Some documents genuinely don’t have all fields.
Building a Pipeline That Scales
For one-off extractions, a script is enough. For ongoing client work, you need a pipeline with queuing, retry logic, and human review for failed extractions.
A practical architecture:
Client uploads PDF
↓
Classify document type (text/scanned/form)
↓
Extract text or OCR
↓
LLM parsing with schema
↓
Pydantic validation
↓
/ \
pass fail
↓ ↓
Write to Human review queue
database
For the queue, a simple database table works at small scale:
CREATE TABLE document_jobs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
status TEXT NOT NULL DEFAULT 'pending',
-- pending → processing → complete | failed | needs_review
input_path TEXT NOT NULL,
result JSONB,
error_message TEXT,
confidence_score FLOAT,
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
Set a confidence threshold below which extractions get flagged for human review. You can calculate confidence by asking the model to return a confidence score alongside the data, or by checking for null fields you’d expect to be present.
A human review interface doesn’t have to be fancy: a table showing the PDF alongside the extracted fields with an “approve” or “correct and approve” action. The goal is catching failures before they corrupt the client’s database.
What to Charge and How to Frame It
Document automation projects have clear ROI, which makes them easier to sell than many AI projects. The frame: if a staff member processes 100 documents per day at 3 minutes each, that’s 5 hours of manual data entry. At a loaded cost of $30/hour, that’s $150/day or $3,000/month. A $15,000 automation project pays for itself in 5 months.
Frame the pricing against the cost of the manual process, not the cost of your engineering time. Clients understand “this replaces $36,000 of annual labor” better than “this costs 120 hours at our rate.”
Scope carefully. The variance in document quality across a client’s archive is almost always higher than expected. Start with a small sample (50-100 documents across the variety they’ll have) before committing to an accuracy guarantee. Documents from 2015 scan worse than documents from 2023. Multi-page contracts with handwritten annotations are harder than single-page typed forms.
Sell a review dashboard alongside the automation. Clients are more comfortable with automated extraction when they can see and approve the results before they hit the production system. The review step also generates the training data you’d need to fine-tune a model if accuracy isn’t good enough with off-the-shelf vision models.
The technology is ready. The main work is in understanding the client’s specific document types and building the validation and human-in-the-loop layer that makes the output trustworthy.
Sponsored
More from this category
More from AI Integration
AI Video Generation in 2026: What Agencies Need to Know Before Pitching It to Clients
Browser-Use Agents: Automating the Web When APIs Don't Exist
Fine-Tuning vs RAG in 2026: A Decision Guide for Teams Building with LLMs
Sponsored
The dispatch
Working notes from
the studio.
A short letter twice a month — what we shipped, what broke, and the AI tools earning their keep.
Discussion
Join the conversation.
Comments are powered by GitHub Discussions. Sign in with your GitHub account to leave a comment.
Sponsored