← Stu Mason
production · private system
Retrieval engineering

Most RAG makes things up.
This one cites its sources.

A naive vector search will answer anything, confidently, whether the answer is in your documents or not. Here is the retrieval system I build instead.

query

What's our refund window for enterprise plans?

Enterprise plans carry a 30-day refund window from the invoice date1, extended to 60 days where a signed MSA specifies it2. Refunds reach the original payment method within 10 working days3.

1billing-policy.md§ 4.2
2enterprise-msa.pdfclause 9
3finance-runbook.md"Refunds"

Every claim maps to a chunk you can open and check.

Watch me walk through it

Me walking through where naive RAG quietly fails, and how hybrid retrieval and the refusal guard fix it.

The hard part

RAG fails in two quiet ways.

Both are invisible in a demo and expensive in production. The whole architecture exists to close them.

It misses the right chunk

A single vector search has one way of being wrong: semantic drift. Ask about a "refund window" and it can sail past the paragraph that says "money-back period" in different words, or rank a near-duplicate above the real source. Recall quietly drops and nobody notices.

It bluffs when it has nothing

When retrieval comes back empty or weak, a plain LLM still answers. It generates something fluent and plausible that is simply not in your data. In a customer-facing or compliance setting, that is the failure that gets someone fired.

Recall: three retrievers, fused

One search angle is never enough.

Vector, full-text and trigram each catch what the others miss. Reciprocal rank fusion combines them without a single hand-tuned weight to drift out of date.

Vectorpgvector cosine similarity

Catches meaning. "refund window" matches "money-back period".

Full-textPostgres tsvector

Catches exact terms a vector can blur past. "refund", "enterprise".

Trigrampg_trgm fuzzy match

Catches typos and partials. "refnd", "ent. plan".

RRF
fused context · top-k
1billing-policy § 4.2
2enterprise-msa clause 9
3finance-runbook "Refunds"

chunks ranking across more than one retriever rise to the top

// RRF score for a chunk
score(d) = Σ 1 / (k + rankr(d))  for each retriever r
Honesty: the refusal guard

"I don't know" is a feature.

Below a retrieval-confidence threshold, the system refuses and says why, instead of generating a confident answer from nothing.

query · how many seats does Acme have on Platinum?

I can't find that in your documents. The indexed sources cover billing policy and the MSA, but none record per-customer seat counts. Point me at the CRM export and I'll answer it.

refused · confidence 0.21 below 0.55 threshold
The retrieval path

One question, five moves.

  1. 01Transform the question

    The raw question is rewritten for retrieval: acronyms expanded, implied context added, sometimes split into sub-queries. A vague question stops returning vague chunks.

  2. 02Three retrievers, in parallel

    Vector for meaning, full-text for exact terms, trigram for fuzzy matches. Each is blind to what the others find, so each covers a different failure mode.

  3. 03Fuse with reciprocal rank fusion

    The three ranked lists merge by RRF: a chunk that ranks well across more than one retriever rises to the top. No hand-tuned weights to drift out of date.

  4. 04Optional rerank

    For high-stakes queries a cross-encoder reranks the fused shortlist, trading a little latency for sharper top-k precision.

  5. 05Ground, cite, or refuse

    The answer is built only from retrieved chunks, each claim carrying a citation. If retrieval confidence is below threshold, the system refuses instead of inventing.

Built to be trusted
Per-project isolation

Each client corpus is its own namespace. No cross-contamination, no one tenant seeing another tenant's data.

Token quotas, per project

Every project carries a token budget. Cost is governed at the source, not discovered on the invoice.

Citations on every claim

Answers map back to the exact chunk and source. A reader can verify, not just trust.

Laravel + MCP

Exposed as an MCP server, so the same retrieval powers an agent, a chat box, or a tool call with no rewrite.

This system runs on private client data, so the repo stays private. The engineering rigour is the same as the open work you can inspect, like coolify-mcp. Happy to walk the code on a call.

What this means for your bench

Your clients have private data and want to ask it questions, without it lying to them.

Grounded retrieval over a client's own documents, shipped as an MCP server or a chat surface, badged as yours. This is one of several production AI systems I have built. If you have a RAG ask you can't staff, that is where I slot in.