Document AI
Classifying the geographic scope of papers with structured outputs
Problem
I had a folder of academic papers and a seemingly simple question: which countries does each paper study? The hard part was not calling a model; it was avoiding false inference. A country should not appear because of author affiliation, journal location, or institution. It should only appear if explicitly named as the setting, population, data source, policy, or analysis focus.
Decision
I built it as a two-step pipeline: PDF to text with logs, then text to JSONL using structured outputs against a
Pydantic schema. The schema forces one of two states: specific_countries with a country list, or
region_or_global with a geographic scope.
That constraint matters more than the prompt: if the model returns countries alongside a regional scope, or declares specific countries without a list, validation fails.
Tradeoffs
- No OCR: scanned PDFs are flagged through low extraction warnings instead of forcing unreliable results.
- Long texts: when a document exceeds the configured limit, the pipeline sends the beginning and end rather than the entire PDF.
- Simple resume:
results.jsonlis append-only and the script skips IDs already processed. - Best-effort normalization: aliases plus
pycountryclean variants such as US/USA/United States.
Validation
Beyond countries and the main conclusion, the extractor can return an explicit Latin America finding with a literal quote. The pipeline then tries to verify that the quote is a substring of the extracted text.
In the local 70-paper run, the system produced 70 JSONL rows: 55 with specific countries and 15 with regional or global scope. It also found 12 rows with LATAM countries and 17 with a Latin America block. Literal quote verification landed at zero, which points to a concrete improvement: compare against normalized text or fuzzy windows, because PDF extraction changes spaces, line breaks, and hyphenation.
Outcome
- Reproducible pipeline:
pdf/→out/text/→out/jsonl/results.jsonl. - Schema validators prevent mixing explicit countries with regional/global scopes.
- Streamlit viewer for exploring results, errors, and local PDFs.
- A clear next-improvement list: optional OCR, normalized quote verification, and manual review for ambiguous cases.
Next
The next step is not adding more fields. It is hardening evaluation: fixtures with difficult examples, quote normalization, and a manually reviewed sample to measure precision before processing larger corpora.