Public
Multiagent document taxonomist
General-purpose cloud-native platform for extracting structured data from unstructured PDFs. Induces an extraction schema from example documents, then processes batches through event-driven LLM workers at scale.
TL;DR
- Schema induction via LLM agent: proposes and refines an extraction schema from example documents automatically.
- Cloud-native architecture: FastAPI (Cloud Run) + event-driven workers (Cloud Functions Gen2) + Firestore state.
- Idempotent by design: per-file transaction guards, exponential backoff, dual LLM fallback (Gemini → OpenAI).
- Domain-agnostic: apply to any document corpus (legal, financial, medical) by inducing a new schema.
Reusable patterns
- Schema induction via LLM agents: auto-designs extraction templates from example documents — no manual field mapping.
- Idempotent event-driven workers: Firestore per-file locks prevent duplicate processing under retries or concurrent triggers.
- Multi-provider LLM fallback: primary (Gemini) fails → automatic fallback (OpenAI) via a shared adapter interface.
- Map/reduce over PDF chunks: parallel LLM extraction per chunk, reduce aggregates with source provenance.
- Auto-generated SDKs from OpenAPI spec: typed JS and Python clients — consumers never import backend internals.
Context
Unstructured PDFs are everywhere — legal filings, financial reports, procurement documents — but extracting structured data from them requires writing a parser for each format.
Goal: build a platform where the extraction schema is induced automatically from examples, and the processing pipeline scales to thousands of documents without manual intervention.
The platform is domain-agnostic: it has been applied to judicial appeal documents (TARCJA) and is designed to work on any document corpus with minimal configuration.
Decisions
- Schema induction via LLM agents: a SchemaDesigner agent proposes field names, types, rules, and synonyms by analyzing a sample of documents — instead of requiring manual schema definition.
- Cloud Functions Gen2 triggered by GCS object events (Eventarc): each uploaded PDF spawns an independent worker, enabling natural parallelism without an explicit queue.
- Firestore for idempotent state: per-file transaction records guard against reprocessing — safe to retry, replay, or resume any job.
- Dual LLM fallback (Gemini → OpenAI): provider failures fall through automatically via a shared adapter interface — swappable with a single env variable.
- Monorepo with shared core package: backend and worker both install `multiagent-core` (editable) — no logic duplication, no cross-app imports.
- Auto-generated SDKs from OpenAPI spec: typed JS and Python clients allow consumers (frontend, CLI scripts) to interact without importing backend internals.
Architecture
- FastAPI (Cloud Run) handles job orchestration, signed GCS upload URLs, schema induction endpoints, and health checks.
- Cloud Function Gen2 triggers on GCS finalization events (Eventarc) — one invocation per uploaded PDF.
- Firestore tracks per-file status atomically, preventing duplicate processing under concurrent retries.
- Shared core package (multiagent-core): domain models, LLM adapters, chunking logic, storage services — installed in both backend and worker.
- React frontend (Vite): schema inducer UI, document uploader, job status monitor, and results viewer — communicates exclusively via the generated JS SDK.
Outcome
- Applied to 5200+ judicial appeal documents (TARCJA project): 95–100% core field coverage, 16.8s avg latency per document.
- Schema induction reduces domain setup from hours of manual field mapping to minutes of agent-guided refinement.
- Idempotent processing: any failed or partial job can be retried or replayed without risk of duplicate records.
- Architecture scales horizontally: each PDF is an independent Cloud Function invocation — throughput grows with GCS upload rate.