Which LLM is best for PDF extraction specifically?

Gemini 3 Pro leads on scientific PDFs with images thanks to native multimodal grounding. Claude Sonnet 4.6 is accurate within its 100-page PDF limit; longer documents need pre-chunking or Claude's batch API. For forms and invoices, Gemini 3 Flash + JSON-schema validator handles most SME volume at ~$0.003 per run (LM Council, 2026).

Can I use open-source models for extraction?

Yes, for cost-sensitive workloads. MiniMax, Qwen3-Max-Instruct, and DeepSeek-V3 hit 60–75% validity on simple schemas, serviceable but below frontier tier. They shine on repetitive extraction at 10× cheaper than Gemini 3 Flash, with the tradeoff being occasional schema drift that a validator catches.

What's the cheapest way to hit production-quality extraction?

Gemini 3 Flash as first attempt ($0.50/$3 per million tokens), JSON-schema validator on every output, cascade to Claude Sonnet 4.6 only on validation failure. This cheap-first cascade cuts cost 50–90% in production (Morph, 2026) while matching Claude-level accuracy on the hard 10–20% of rows.

Do I need fine-tuning for structured extraction?

Rarely. Prompt engineering plus a tight JSON schema gets you to 85%+ validity on most extraction tasks. Fine-tune only when you hit cost walls at scale or need <200ms latency on specialized domains. For the typical SME, prompt + schema + validator covers 95% of production needs.

How fast are LLM prices falling in 2026?

Anthropic dropped Opus pricing 67% and expanded context to 1M tokens in February 2026 (TLDL, 2026). Frontier-quality tiers have been falling 50–80% per year since 2023. Budget for quarterly re-evaluation, a model that's too expensive today may be the right default in 12 weeks.

What's the 369-field schema cliff?

ExtractBench (February 2026) found all frontier models produce 0% valid output on a 369-field financial-reporting schema, Claude, GPT-5, and Gemini 3 Pro all fail. The fix is structural: break big schemas into smaller sub-extractions, then merge. Schema size predicts failure better than model choice.

Claude vs GPT-5 vs Gemini: 2026 Extraction Benchmark

Enterprise spend on LLM APIs jumped from $3.5 billion to $8.4 billion in six months, with Menlo Ventures projecting $15 billion by end of 2026 (Morph, 2026). A lot of that money is burning on the wrong model. Pick Claude Opus for a job Gemini Flash handles and you'll overspend 30×; pick Gemini Flash for a 369-field financial schema and you'll get 0% valid output (ExtractBench, 2026).

This is the head-to-head comparison of Claude Sonnet 4.6, GPT-5.4, and Gemini 3 Flash for structured data extraction, with real 2026 pricing, ExtractBench numbers, and the routing pattern that cuts cost 50–90% without sacrificing accuracy.

TL;DR

Gemini 3 Flash leads ExtractBench validity at 71% and pass rate at 6.9%, beating the larger Gemini 3 Pro on realistic schemas (ExtractBench, 2026).
Claude Sonnet 4.5/4.6 tie with Flash at 83% validity on research papers — accuracy ceiling for complex nested extraction.
GPT-5.4 has lowest validity (37%) but highest valid-only accuracy (80.4%) — useful on easy domains, risky on messy ones.
Pricing spread (Apr 2026): Gemini 3 Flash $0.50/$3, GPT-5.4 $2.50 input, Claude Sonnet 4.6 $3/$15 per million tokens — a 6× gap between the floor and the ceiling.
All models hit 0% valid output at 369-field schemas — schema design matters more than model choice.
Cheap-first cascade routing cuts production cost 50–90% (Morph, 2026) while preserving Claude-level accuracy.

What's the fair way to benchmark extraction in 2026?

Three axes decide which model wins: schema validity, field-level accuracy, and cost per 10K runs. Most published "LLM comparisons" use coding or reasoning benchmarks. Neither correlates with extraction performance. The right yardstick is ExtractBench (Feb 2026), which tests frontier models across 5 domains, research papers, credit agreements, wet-lab protocols, financial reports, and messy HTML.

ExtractBench's most useful finding isn't the leaderboard. It's the 369-field schema cliff: when schemas grow past ~300 required fields with deep nesting, every frontier model drops to 0% valid output. Field count × nesting depth predicts failure better than model choice. That single finding flips the procurement question from "which model is best?" to "how do I design schemas my model can actually hit?"

SO-Bench (Nov 2025) tracks structural compliance, whether JSON parses cleanly and matches the schema, which is usually a precondition for field accuracy. OmniAI's OCR benchmark handles the layout-sensitive end: receipts, forms, and rotated scans where multimodal grounding decides the outcome. Use ExtractBench for realistic mixed workloads, SO-Bench when structure is the constraint, and OmniAI when inputs are image-heavy.

For your own workload, build a 200-row golden set from real production inputs. Score three runs per model, validity rate, field-level F1, and cost, and you'll know more than any public benchmark can tell you. The transform layer pattern is where this evaluation belongs anyway.

Which model is most accurate on messy inputs?

On the ExtractBench research-paper domain, Gemini 3 Flash and Claude Sonnet 4.5 tie at 83% validity. Across all five domains, Gemini 3 Flash leads overall validity at 71% and pass rate at 6.9%, beating the larger Gemini 3 Pro on realistic schemas (ExtractBench, 2026). GPT-5 has the lowest validity at 37% but highest valid-only accuracy at 80.4%, meaning when it produces valid JSON, the fields are usually right, but it fails to produce valid JSON too often to rely on without a retry layer.

One important caveat: Claude Sonnet 4.5/4.6 was disqualified from ExtractBench's credit-agreements domain by a 100-page PDF input limit. Real-world long-PDF workloads need pre-chunking, Claude's batch API, or a different model for that slice. On the domains where Sonnet competed, it matched or beat Gemini 3 Flash on accuracy, but at 6× the cost.

For multimodal inputs, scientific papers with embedded figures, OCR'd forms, rotated scans, Gemini 3 Pro wins on native image grounding. Schema-strict mode (Gemini's response_schema, Claude's tool-use JSON, OpenAI's strict mode) lifts validity 10–20 percentage points over free-form JSON prompts. Always enable it.

Which model is cheapest per 10K extractions?

At Apr 2026 list prices, Gemini 3 Flash is the frontier-quality floor at $0.50 input / $3 output per million tokens. GPT-5.4 sits at $2.50 input. Claude Sonnet 4.6 is $3 input / $15 output, a 6× gap on input and 5× gap on output versus Gemini. Anthropic dropped Opus pricing 67% and expanded context to 1M tokens in February 2026 (TLDL, 2026), narrowing the top-tier gap.

Discounts change the math quickly. Both Anthropic and OpenAI offer 50% off batch APIs with 24-hour turnaround, ideal for overnight enrichment jobs. Anthropic's prompt caching gives 90% discount on cache hits (Gemini supports caching too, with similar economics). For a workload that re-uses the same system prompt across 10K calls, caching alone can cut the bill in half before routing enters the picture.

On simple schemas, Gemini 3 Flash hits ~$0.003 per extraction run (LM Council, 2026). At SME volume, say 50K enrichments per month, that's $150/month total for frontier quality. The open-source tier (MiniMax, Qwen3-Max-Instruct, DeepSeek-V3) goes even cheaper, typically 3–10× below Gemini Flash, with accuracy that holds for simple extraction and wobbles on nested or OCR-heavy inputs.

If you're orchestrating this through a workflow tool, n8n handles the routing layer at a fraction of Zapier's per-task cost.

When does premium pricing actually earn it back?

Premium models pay for themselves when a wrong extraction is expensive downstream. A mis-extracted lead title costs a few minutes of SDR time. A mis-extracted product price or regulatory filing costs real money. Break-even math for most workloads: when downstream cost of a bad row exceeds ~€40, Claude Sonnet's accuracy premium justifies itself over Gemini Flash at typical volumes.

Two edge schemas where Gemini Flash quietly fails and we now route around it: multi-row tables with merged cells (Flash hallucinates field alignment when a cell spans three rows), and OCR'd forms with rotated text (Flash mis-assigns fields when the source layout is tilted more than ~8 degrees). We caught both through schema-mismatch rates spiking to 12–18% on specific document types. The fix wasn't to drop Flash, it was to route these two schemas directly to Sonnet and keep Flash on everything else.

Use Opus when you need the deepest reasoning across 1M tokens, long contracts, full codebases, multi-document cross-reference. Latency is 4–8 seconds, which rules it out of real-time use but is fine for overnight batch. Use Sonnet when accuracy matters and latency can tolerate 2–4 seconds. Use Flash when the schema is simple, the input is clean, and the volume is high, which is most SME extraction.

The rule I'd etch onto every extraction pipeline: don't default to the expensive model because it's "safer." Default to the cheapest model that validates, and escalate only when it doesn't. That's exactly how our qualifying scraped leads with LLMs playbook is wired.

How do you route between models to cut extraction cost?

Cheap-first cascade: try Gemini Flash, validate the JSON output, escalate only failed rows to Claude Sonnet. This pattern cuts production cost 50–90% (Morph, 2026) while preserving frontier-model accuracy where it matters. Academic research on cascade strategies reports 98% cost reduction at GPT-4-equivalent quality on extraction tasks (LeanLM, 2026).

The Python skeleton is short enough to read in one pass:

from pydantic import BaseModel, ValidationError

class Lead(BaseModel):
    name: str
    company: str
    role: str
    email: str

def extract(row):
    raw = gemini_flash.extract(row, schema=Lead)
    try:
        return Lead.model_validate_json(raw)
    except ValidationError:
        raw = claude_sonnet.extract(row, schema=Lead, context=raw)
        return Lead.model_validate_json(raw)  # retry loop omitted

In n8n, the same pattern is three nodes: LLM (Gemini Flash) → IF (validator) → LLM (Claude Sonnet) on the false branch. Dedupe and log before the validator. Combined with batching and prompt caching, production systems see 47–80% total savings versus a Claude-only baseline (Morph, 2026).

One wrinkle worth calling out: don't escalate on every ambiguous field, escalate on schema failures only. A validator that catches malformed JSON and missing required fields flags the right subset. A validator that tries to judge "is this field content correct?" ends up escalating 40%+ of rows and erases the cost savings.

What about latency and throughput?

Gemini 3 Flash delivers p95 latency under 1 second; Claude Sonnet runs 2–4 seconds; Opus runs 4–8 seconds. For batch extraction, latency rarely matters, you're processing overnight, and the bottleneck is API rate limits, not per-request speed. For real-time enrichment during a chat or form submission, Flash is the only frontier model with acceptable latency.

Throughput rankings (approximate, from production workloads): Gemini 3 Flash ~150+ tokens/sec output, Sonnet ~80 t/s, Opus ~60 t/s. All three providers rate-limit by RPM and TPM per tier; for sustained high volume, Anthropic and OpenAI batch APIs run overnight at 50% discount, the right choice for daily enrichment runs where same-day freshness is enough.

One practical note: batch APIs have queue latency that varies from 2 hours to the full 24. Don't build a workflow that promises "data ready by 9 AM" if you're queuing at 11 PM. The safer pattern is to batch at 6–8 PM and treat anything after 10 PM as next-business-day output.

How do you stop hallucinations in structured extraction?

JSON-schema validation plus a retry loop catches the vast majority of field-level errors. Without schema validation, expect 5–15% silent field drift in production, wrong dates, transposed digits, invented values that look plausible. Schema-strict mode at the provider level (Gemini response_schema, Claude tool-use, OpenAI strict mode) lifts baseline validity 10–20 percentage points and is the first setting to turn on.

Across SIÁN's 10K-page internal benchmark, schema-strict mode plus a Pydantic validator plus one retry caught 94% of the errors that would have otherwise landed in the CRM. The residual 6% was almost entirely ambiguous source data, the source page was inconsistent or the target field genuinely didn't exist, not model failure. That's the right error-rate frontier for most production workloads.

For the harder 369-field cliff, the fix is structural, not prompt-engineering. Break big schemas into smaller sub-extractions and merge. A 400-field invoice becomes five 80-field schemas, header, line-items, tax, payment-terms, metadata, each of which any frontier model handles cleanly. The merge step is deterministic code, not another LLM call.

Add an "other" catchall bucket to your schema for fields the model isn't sure about. It's the pressure-release valve that keeps validators from rejecting otherwise-correct rows over one ambiguous field. Pair validation with modern pipeline hygiene, idempotent retries, source-URL logging, run-level audit trails, and you have an extraction layer you can actually trust to run unsupervised.

The 2026 extraction stack, in five bullets

Cost floor: Gemini 3 Flash at $0.50/$3 per million tokens, ~$0.003 per simple extraction run.
Accuracy ceiling: Claude Sonnet 4.6 (or Opus 4.1 for long-context, 1M tokens).
Routing pattern: Cheap-first cascade — Gemini Flash → validator → escalate to Sonnet on failure. 50–90% cost savings.
Schema discipline: Break anything over 150 fields into sub-extractions. Frontier models fall off a cliff around 369 fields.
Validation: Always. Schema-strict mode + Pydantic/Zod + one retry catches 94% of production errors.

Ready to ship this pattern? Book a SIÁN extraction-scoping call and we'll map your workload to the right model mix. If you're earlier in the stack, start with our LLM-powered scraping stack guide and our enterprise-volume extraction playbook.

Claude vs GPT-5 vs Gemini: 2026 Extraction Benchmark

TL;DR

What's the fair way to benchmark extraction in 2026?

Which model is most accurate on messy inputs?

Which model is cheapest per 10K extractions?

When does premium pricing actually earn it back?

How do you route between models to cut extraction cost?

What about latency and throughput?

How do you stop hallucinations in structured extraction?

The 2026 extraction stack, in five bullets

About SIÁN Team

Where Real-Estate Agencies Leak Time and Money (2026 Data)

Scraping-Powered Lead Generation: The 2026 SME Playbook

Need help with web scraping?

More Articles

Where Real-Estate Agencies Leak Time and Money (2026 Data)

Scraping-Powered Lead Generation: The 2026 SME Playbook

Data Pipeline for Small Business: The 2026 SME Guide

Want to automate your data workflow?