A naive vector search will answer anything, confidently, whether the answer is in your documents or not. Here is the retrieval system I build instead.
What's our refund window for enterprise plans?
Enterprise plans carry a 30-day refund window from the invoice date1, extended to 60 days where a signed MSA specifies it2. Refunds reach the original payment method within 10 working days3.
Every claim maps to a chunk you can open and check.
Me walking through where naive RAG quietly fails, and how hybrid retrieval and the refusal guard fix it.
Both are invisible in a demo and expensive in production. The whole architecture exists to close them.
A single vector search has one way of being wrong: semantic drift. Ask about a "refund window" and it can sail past the paragraph that says "money-back period" in different words, or rank a near-duplicate above the real source. Recall quietly drops and nobody notices.
When retrieval comes back empty or weak, a plain LLM still answers. It generates something fluent and plausible that is simply not in your data. In a customer-facing or compliance setting, that is the failure that gets someone fired.
Vector, full-text and trigram each catch what the others miss. Reciprocal rank fusion combines them without a single hand-tuned weight to drift out of date.
Catches meaning. "refund window" matches "money-back period".
Catches exact terms a vector can blur past. "refund", "enterprise".
Catches typos and partials. "refnd", "ent. plan".
chunks ranking across more than one retriever rise to the top
Below a retrieval-confidence threshold, the system refuses and says why, instead of generating a confident answer from nothing.
I can't find that in your documents. The indexed sources cover billing policy and the MSA, but none record per-customer seat counts. Point me at the CRM export and I'll answer it.
The raw question is rewritten for retrieval: acronyms expanded, implied context added, sometimes split into sub-queries. A vague question stops returning vague chunks.
Vector for meaning, full-text for exact terms, trigram for fuzzy matches. Each is blind to what the others find, so each covers a different failure mode.
The three ranked lists merge by RRF: a chunk that ranks well across more than one retriever rises to the top. No hand-tuned weights to drift out of date.
For high-stakes queries a cross-encoder reranks the fused shortlist, trading a little latency for sharper top-k precision.
The answer is built only from retrieved chunks, each claim carrying a citation. If retrieval confidence is below threshold, the system refuses instead of inventing.
Each client corpus is its own namespace. No cross-contamination, no one tenant seeing another tenant's data.
Every project carries a token budget. Cost is governed at the source, not discovered on the invoice.
Answers map back to the exact chunk and source. A reader can verify, not just trust.
Exposed as an MCP server, so the same retrieval powers an agent, a chat box, or a tool call with no rewrite.
This system runs on private client data, so the repo stays private. The engineering rigour is the same as the open work you can inspect, like coolify-mcp. Happy to walk the code on a call.
Grounded retrieval over a client's own documents, shipped as an MCP server or a chat surface, badged as yours. This is one of several production AI systems I have built. If you have a RAG ask you can't staff, that is where I slot in.