Public

Multiagent document taxonomist

General-purpose cloud-native platform for extracting structured data from unstructured PDFs. Induces an extraction schema from example documents, then processes batches through event-driven LLM workers at scale.

Stack

Python · FastAPI · GCP (Cloud Run, Cloud Functions, Firestore) · Gemini · OpenAI · Pydantic · React · Vite

Artifacts

Public demo / repo

TL;DR

Schema induction via LLM agent: proposes and refines an extraction schema from example documents automatically.
Cloud-native architecture: FastAPI (Cloud Run) + event-driven workers (Cloud Functions Gen2) + Firestore state.
Idempotent by design: per-file transaction guards, exponential backoff, dual LLM fallback (Gemini, then OpenAI).
Domain-agnostic: apply to any document corpus (legal, financial, medical) by inducing a new schema.

Reusable patterns

Schema induction via LLM agents: auto-designs extraction templates from example documents, with no manual field mapping.
Idempotent event-driven workers: Firestore per-file locks prevent duplicate processing under retries or concurrent triggers.
Multi-provider LLM fallback: if the primary (Gemini) fails, the system switches automatically to the backup (OpenAI) via a shared adapter interface.
Map/reduce over PDF chunks: parallel LLM extraction per chunk, reduce aggregates with source provenance.
Auto-generated SDKs from OpenAPI spec: typed JS and Python clients; consumers never import backend internals.

Context

Unstructured PDFs are everywhere (legal filings, financial reports, procurement documents), but extracting structured data from them requires writing a parser for each format.

Goal: build a platform where the extraction schema is induced automatically from examples, and the processing pipeline scales to thousands of documents without manual intervention.

The platform is domain-agnostic: it has been applied to judicial appeal documents (TARCJA) and is designed to work on any document corpus with minimal configuration.

Decisions

Schema induction via LLM agents: a SchemaDesigner agent proposes field names, types, rules, and synonyms by analyzing a sample of documents, instead of requiring manual schema definition.
Cloud Functions Gen2 triggered by GCS object events (Eventarc): each uploaded PDF spawns an independent worker, enabling natural parallelism without an explicit queue.
Firestore for idempotent state: per-file transaction records guard against reprocessing, so any job is safe to retry, replay, or resume.
Dual LLM fallback (Gemini, then OpenAI): provider failures fall through automatically via a shared adapter interface, swappable with a single env variable.
Monorepo with shared core package: backend and worker both install multiagent-core (editable), with no logic duplication and no cross-app imports.
Auto-generated SDKs from OpenAPI spec: typed JS and Python clients allow consumers (frontend, CLI scripts) to interact without importing backend internals.

Architecture

FastAPI (Cloud Run) handles job orchestration, signed GCS upload URLs, schema induction endpoints, and health checks.
Cloud Function Gen2 triggers on GCS finalization events (Eventarc): one invocation per uploaded PDF.
Firestore tracks per-file status atomically, preventing duplicate processing under concurrent retries.
Shared core package (multiagent-core): domain models, LLM adapters, chunking logic, and storage services, installed in both backend and worker.
React frontend (Vite): schema inducer UI, document uploader, job status monitor, and results viewer, communicating exclusively via the generated JS SDK.

Outcome

Applied to 5200+ judicial appeal documents (TARCJA project): 95–100% core field coverage, 16.8s avg latency per document.
Schema induction reduces domain setup from hours of manual field mapping to minutes of agent-guided refinement.
Idempotent processing: any failed or partial job can be retried or replayed without risk of duplicate records.
Architecture scales horizontally: each PDF is an independent Cloud Function invocation, and throughput grows with GCS upload rate.