
Public
TARCJA: judicial document extraction & dashboard
End-to-end application of a general-purpose LLM taxonomist to a corpus of judicial appeal documents. Extracted structured records at scale, then built a static Plotly dashboard for exploration and multi-run comparison.
Stack
Python · Plotly · Jinja2 · GCP (Cloud Run, Cloud Functions, Firestore) · Gemini · OpenAI
Artifacts
Public demo / repo
TL;DR
- Applied the multiagent document taxonomist to 5200+ judicial appeal PDFs — no per-document parsers.
- Designed a domain schema for judicial appeals: facts, actors, jurisdiction, lifecycle, risk, and outcome fields.
- Processed at scale: 95–100% core field coverage, 16.8s avg latency per document, p95 26.1s.
- Built a static Plotly + Jinja2 dashboard deployed on GitHub Pages with multi-run comparison view.
Reusable patterns
- Domain schema design for unstructured legal text: universal facets (actors, jurisdiction, lifecycle, risk) transfer across document types.
- Static dashboard from JSONL: Plotly + Jinja2 generates standalone HTML — no server needed post-extraction.
- Multi-run comparison UI: version selector and compare view expose schema or data changes across extraction runs.
- Local-first tools for iteration: run schema induction and extraction locally before deploying to cloud workers.
Context
Started as a submission to the Gemini API Developer Competition on Devpost.
The domain: a corpus of judicial appeal documents in unstructured PDF format, with no consistent layout or field structure.
Goal: extract structured, queryable records from each document automatically, then explore patterns across the full corpus.
Decisions
- Reused the multiagent document taxonomist as the extraction backend — no need to rebuild schema induction or cloud workers for this domain.
- Designed a judicial-specific schema using the SchemaDesigner agent on a sample of documents: key fields include parties, jurisdiction, proceeding type, lifecycle stage, and outcome.
- Chose static HTML output (Plotly + Jinja2) for the dashboard: no database or server needed to share results, and GitHub Pages is free.
- Multi-run structure in the dashboard: each extraction run produces its own view, with a comparison page to surface changes across schema versions or data updates.
Architecture
- Extraction backend: multiagent document taxonomist platform (FastAPI + Cloud Functions + Firestore).
- Schema: induced via SchemaDesigner agent on a sample of judicial PDFs, then refined manually for domain accuracy.
- Output: JSONL per run, stored locally and fed into dashboard_tarc for visualization.
- dashboard_tarc CLI: builds standalone HTML dashboards from JSONL + schema manifest — one command, no server.
- GitHub Pages deployment: static dist/ pushed to the docs/ branch, no CI/CD needed.
Outcome
- 5200+ judicial documents processed end-to-end.
- 95–100% field coverage on core schema fields (parties, jurisdiction, proceeding type, outcome).
- Avg LLM extraction latency: 16.8s per document; p95: 26.1s.
- Static dashboard deployed on GitHub Pages with multi-run comparison and run selector.
- Continued development after the hackathon: schema versioning, additional facets, observability improvements.
Links
Related