Screenshot of the TARCJA document extraction dashboard

Public

TARCJA: judicial document extraction & dashboard

End-to-end application of a general-purpose LLM taxonomist to a corpus of judicial appeal documents. Extracted structured records at scale, then built a static Plotly dashboard for exploration and multi-run comparison.

Stack

Python · Plotly · Jinja2 · GCP (Cloud Run, Cloud Functions, Firestore) · Gemini · OpenAI

Artifacts

Public demo / repo

TL;DR

  • Applied the multiagent document taxonomist to 5200+ judicial appeal PDFs — no per-document parsers.
  • Designed a domain schema for judicial appeals: facts, actors, jurisdiction, lifecycle, risk, and outcome fields.
  • Processed at scale: 95–100% core field coverage, 16.8s avg latency per document, p95 26.1s.
  • Built a static Plotly + Jinja2 dashboard deployed on GitHub Pages with multi-run comparison view.

Reusable patterns

  • Domain schema design for unstructured legal text: universal facets (actors, jurisdiction, lifecycle, risk) transfer across document types.
  • Static dashboard from JSONL: Plotly + Jinja2 generates standalone HTML — no server needed post-extraction.
  • Multi-run comparison UI: version selector and compare view expose schema or data changes across extraction runs.
  • Local-first tools for iteration: run schema induction and extraction locally before deploying to cloud workers.

Context

Started as a submission to the Gemini API Developer Competition on Devpost.

The domain: a corpus of judicial appeal documents in unstructured PDF format, with no consistent layout or field structure.

Goal: extract structured, queryable records from each document automatically, then explore patterns across the full corpus.

Decisions

  • Reused the multiagent document taxonomist as the extraction backend — no need to rebuild schema induction or cloud workers for this domain.
  • Designed a judicial-specific schema using the SchemaDesigner agent on a sample of documents: key fields include parties, jurisdiction, proceeding type, lifecycle stage, and outcome.
  • Chose static HTML output (Plotly + Jinja2) for the dashboard: no database or server needed to share results, and GitHub Pages is free.
  • Multi-run structure in the dashboard: each extraction run produces its own view, with a comparison page to surface changes across schema versions or data updates.

Architecture

  • Extraction backend: multiagent document taxonomist platform (FastAPI + Cloud Functions + Firestore).
  • Schema: induced via SchemaDesigner agent on a sample of judicial PDFs, then refined manually for domain accuracy.
  • Output: JSONL per run, stored locally and fed into dashboard_tarc for visualization.
  • dashboard_tarc CLI: builds standalone HTML dashboards from JSONL + schema manifest — one command, no server.
  • GitHub Pages deployment: static dist/ pushed to the docs/ branch, no CI/CD needed.

Outcome

  • 5200+ judicial documents processed end-to-end.
  • 95–100% field coverage on core schema fields (parties, jurisdiction, proceeding type, outcome).
  • Avg LLM extraction latency: 16.8s per document; p95: 26.1s.
  • Static dashboard deployed on GitHub Pages with multi-run comparison and run selector.
  • Continued development after the hackathon: schema versioning, additional facets, observability improvements.

Links

Related