From internal masterclasses to operating judgment for regulatory RAG

Problem

In applied AI for fiscal and regulatory domains, knowing how to use a framework is not enough. The real problem is deciding what to measure, what to retrieve, what to put in production, and how to detect that the system got worse. That capability gets lost if notes stay as a loose list of articles, demos, and prompts.

Decision

I organized the study work as four cumulative masterclasses: AI system evaluation, information retrieval, production patterns, and inference economics. The throughline is the same: RAG systems over Chilean regulatory and fiscal corpora, with theory, executable code, and examples.

In the local repo, the first three parts are closed: 12 eval sections, 9 retrieval sections, and 12 production sections. Inference economics remains the next dedicated module, although the production layer already includes cost metering, budget guards, and cost-aware routing.

Tradeoffs

Implement primitives from scratch when they teach judgment: BM25, TF-IDF, RRF, token buckets, circuit breakers, and caches.
Prefer Chilean corpora and examples over generic demos that hide domain-specific failures.
Separate evaluation, retrieval, and production so each failure has its own diagnosis.
Avoid oversized infrastructure: start with FastAPI, Postgres/pgvector, enough observability, and runbooks.

Validation

Each section pairs a theory document with an executable script. Evals covers golden datasets, retrieval and generation metrics, LLM-as-judge, bootstrap, and CI integration. Retrieval compares BM25, embeddings, hybrid search, chunking, query rewriting, reranking, and domain edge cases. Production adds versioned prompts, multi-level caching, tracing, retries, circuit breakers, canary, online evals, cost, security, and incidents.

Outcome

A reusable map for designing fiscal/regulatory RAG systems with explicit criteria.
Small code that exposes the mechanisms before hiding them behind libraries.
Base material for posts, internal workshops, or client conversations about high-stakes AI.
A clear progression: measure first, retrieve better, then operate with visible costs and failures.

The publishable path is not dumping the full masterclass into the blog. It is distilling concrete pieces: how to build a fiscal golden dataset, when BM25 beats embeddings, or how to calculate the real cost of a RAG system before selling it.