Architecture diagram: PDF corpus → schema induction → parallel LLM workers → structured JSONL output

Public

Multiagent document taxonomist

General-purpose cloud-native platform for extracting structured data from unstructured PDFs. Induces an extraction schema from example documents, then processes batches through event-driven LLM workers at scale.

Stack

Python · FastAPI · GCP (Cloud Run, Cloud Functions, Firestore) · Gemini · OpenAI · Pydantic · React · Vite

Artifacts

Public demo / repo

TL;DR

  • Schema induction via LLM agent: proposes and refines an extraction schema from example documents automatically.
  • Cloud-native architecture: FastAPI (Cloud Run) + event-driven workers (Cloud Functions Gen2) + Firestore state.
  • Idempotent by design: per-file transaction guards, exponential backoff, dual LLM fallback (Gemini → OpenAI).
  • Domain-agnostic: apply to any document corpus (legal, financial, medical) by inducing a new schema.

Reusable patterns

  • Schema induction via LLM agents: auto-designs extraction templates from example documents — no manual field mapping.
  • Idempotent event-driven workers: Firestore per-file locks prevent duplicate processing under retries or concurrent triggers.
  • Multi-provider LLM fallback: primary (Gemini) fails → automatic fallback (OpenAI) via a shared adapter interface.
  • Map/reduce over PDF chunks: parallel LLM extraction per chunk, reduce aggregates with source provenance.
  • Auto-generated SDKs from OpenAPI spec: typed JS and Python clients — consumers never import backend internals.

Context

Unstructured PDFs are everywhere — legal filings, financial reports, procurement documents — but extracting structured data from them requires writing a parser for each format.

Goal: build a platform where the extraction schema is induced automatically from examples, and the processing pipeline scales to thousands of documents without manual intervention.

The platform is domain-agnostic: it has been applied to judicial appeal documents (TARCJA) and is designed to work on any document corpus with minimal configuration.

Decisions

  • Schema induction via LLM agents: a SchemaDesigner agent proposes field names, types, rules, and synonyms by analyzing a sample of documents — instead of requiring manual schema definition.
  • Cloud Functions Gen2 triggered by GCS object events (Eventarc): each uploaded PDF spawns an independent worker, enabling natural parallelism without an explicit queue.
  • Firestore for idempotent state: per-file transaction records guard against reprocessing — safe to retry, replay, or resume any job.
  • Dual LLM fallback (Gemini → OpenAI): provider failures fall through automatically via a shared adapter interface — swappable with a single env variable.
  • Monorepo with shared core package: backend and worker both install `multiagent-core` (editable) — no logic duplication, no cross-app imports.
  • Auto-generated SDKs from OpenAPI spec: typed JS and Python clients allow consumers (frontend, CLI scripts) to interact without importing backend internals.

Architecture

  • FastAPI (Cloud Run) handles job orchestration, signed GCS upload URLs, schema induction endpoints, and health checks.
  • Cloud Function Gen2 triggers on GCS finalization events (Eventarc) — one invocation per uploaded PDF.
  • Firestore tracks per-file status atomically, preventing duplicate processing under concurrent retries.
  • Shared core package (multiagent-core): domain models, LLM adapters, chunking logic, storage services — installed in both backend and worker.
  • React frontend (Vite): schema inducer UI, document uploader, job status monitor, and results viewer — communicates exclusively via the generated JS SDK.

Outcome

  • Applied to 5200+ judicial appeal documents (TARCJA project): 95–100% core field coverage, 16.8s avg latency per document.
  • Schema induction reduces domain setup from hours of manual field mapping to minutes of agent-guided refinement.
  • Idempotent processing: any failed or partial job can be retried or replayed without risk of duplicate records.
  • Architecture scales horizontally: each PDF is an independent Cloud Function invocation — throughput grows with GCS upload rate.