NotesFeb 2025· 2 min read

RAG systems in production: retrieval gone wrong

Why semantic search is necessary but insufficient, and what you're actually debugging when your RAG fails.

RAG (Retrieval-Augmented Generation) looks simple on the whiteboard:

In production, everything breaks.

The retrieval bottleneck

Your embedding model is fast. But it's not smart.

You ask: "How do I fix my car?" Your retrieval returns:

The LLM is good at following instructions, but it can't unbury the signal from noise.

You have 100 documents. 10 are relevant. 90 are noise.

Your embedding similarity scores:

There's overlap. Your top_k=5 misses relevant docs.

You increase to top_k=20. Now you add noise. The LLM gets confused trying to synthesize contradictions.

LLM gets context that contradicts its training data. Now what?

Your docs say: "The capital of France is London" The LLM knows: "The capital of France is Paris"

Does it:

There's no clean solution. You're stuck.

Retrieval adds latency:

Total: 800ms

Your SLA was 200ms. You're now 4x over budget.

Be selective: Filter documents before embedding (date ranges, document types)
Re-rank aggressively: Use a second model to filter top-k results
Cache everything: Cache embeddings, cache chunks, cache LLM outputs
Monitor retrieval quality: Track when your docs don't help the LLM

RAG is powerful. But only when retrieval is good. When retrieval is bad, RAG amplifies the wrong signal.

RAG doesn't fix poor source data. It just makes it faster to hallucinate based on it.

Spend 80% of your effort on data quality. Spend 20% on architecture.

Share: