Classifying the geographic scope of papers with structured outputs

Document AI

Classifying the geographic scope of papers with structured outputs

#openai#structured outputs#pydantic#papers#pdf

Problem

I had a folder of academic papers and a seemingly simple question: which countries does each paper study? The hard part was not calling a model; it was avoiding false inference. A country should not appear because of author affiliation, journal location, or institution. It should only appear if explicitly named as the setting, population, data source, policy, or analysis focus.

Decision

I built it as a two-step pipeline: PDF to text with logs, then text to JSONL using structured outputs against a Pydantic schema. The schema forces one of two states: specific_countries with a country list, or region_or_global with a geographic scope.

That constraint matters more than the prompt: if the model returns countries alongside a regional scope, or declares specific countries without a list, validation fails.

Tradeoffs

  • No OCR: scanned PDFs are flagged through low extraction warnings instead of forcing unreliable results.
  • Long texts: when a document exceeds the configured limit, the pipeline sends the beginning and end rather than the entire PDF.
  • Simple resume: results.jsonl is append-only and the script skips IDs already processed.
  • Best-effort normalization: aliases plus pycountry clean variants such as US/USA/United States.

Validation

Beyond countries and the main conclusion, the extractor can return an explicit Latin America finding with a literal quote. The pipeline then tries to verify that the quote is a substring of the extracted text.

In the local 70-paper run, the system produced 70 JSONL rows: 55 with specific countries and 15 with regional or global scope. It also found 12 rows with LATAM countries and 17 with a Latin America block. Literal quote verification landed at zero, which points to a concrete improvement: compare against normalized text or fuzzy windows, because PDF extraction changes spaces, line breaks, and hyphenation.

Outcome

  • Reproducible pipeline: pdf/out/text/out/jsonl/results.jsonl.
  • Schema validators prevent mixing explicit countries with regional/global scopes.
  • Streamlit viewer for exploring results, errors, and local PDFs.
  • A clear next-improvement list: optional OCR, normalized quote verification, and manual review for ambiguous cases.

Next

The next step is not adding more fields. It is hardening evaluation: fixtures with difficult examples, quote normalization, and a manually reviewed sample to measure precision before processing larger corpora.