Public

TARCJA: judicial document extraction & dashboard

End-to-end application of a general-purpose LLM taxonomist to a corpus of judicial appeal documents. Extracted structured records at scale, then built a static Plotly dashboard for exploration and multi-run comparison.

Stack

Python · Plotly · Jinja2 · GCP (Cloud Run, Cloud Functions, Firestore) · Gemini · OpenAI

Artifacts

Public demo / repo

TL;DR

Applied the multiagent document taxonomist to 5200+ judicial appeal PDFs, with no per-document parsers.
Designed a domain schema for judicial appeals: facts, actors, jurisdiction, lifecycle, risk, and outcome fields.
Processed at scale: 95–100% core field coverage, 16.8s avg latency per document, p95 26.1s.
Built a static Plotly + Jinja2 dashboard deployed on GitHub Pages with multi-run comparison view.

Reusable patterns

Domain schema design for unstructured legal text: universal facets (actors, jurisdiction, lifecycle, risk) transfer across document types.
Static dashboard from JSONL: Plotly + Jinja2 generates standalone HTML, with no server needed post-extraction.
Multi-run comparison UI: version selector and compare view expose schema or data changes across extraction runs.
Local-first tools for iteration: run schema induction and extraction locally before deploying to cloud workers.

Context

Started as a submission to the Gemini API Developer Competition on Devpost.

The domain: a corpus of judicial appeal documents in unstructured PDF format, with no consistent layout or field structure.

Goal: extract structured, queryable records from each document automatically, then explore patterns across the full corpus.

Decisions

Reused the multiagent document taxonomist as the extraction backend, with no need to rebuild schema induction or cloud workers for this domain.
Designed a judicial-specific schema using the SchemaDesigner agent on a sample of documents: key fields include parties, jurisdiction, proceeding type, lifecycle stage, and outcome.
Chose static HTML output (Plotly + Jinja2) for the dashboard: no database or server needed to share results, and GitHub Pages is free.
Multi-run structure in the dashboard: each extraction run produces its own view, with a comparison page to surface changes across schema versions or data updates.

Architecture

Extraction backend: multiagent document taxonomist platform (FastAPI + Cloud Functions + Firestore).
Schema: induced via SchemaDesigner agent on a sample of judicial PDFs, then refined manually for domain accuracy.
Output: JSONL per run, stored locally and fed into dashboard_tarc for visualization.
dashboard_tarc CLI: builds standalone HTML dashboards from JSONL + schema manifest; one command, no server.
GitHub Pages deployment: static dist/ pushed to the docs/ branch, no CI/CD needed.

Outcome

5200+ judicial documents processed end-to-end.
95–100% field coverage on core schema fields (parties, jurisdiction, proceeding type, outcome).
Avg LLM extraction latency: 16.8s per document; p95: 26.1s.
Static dashboard deployed on GitHub Pages with multi-run comparison and run selector.
Continued development after the hackathon: schema versioning, additional facets, observability improvements.