Notes· 2 min read

RAG systems in production: retrieval gone wrong

Why semantic search is necessary but insufficient, and what you're actually debugging when your RAG fails.

RAG systems in production: retrieval gone wrong

RAG (Retrieval-Augmented Generation) looks simple on the whiteboard:

  1. Query comes in
  2. Retrieve relevant docs
  3. Stuff them in prompt
  4. LLM generates answer with context

In production, everything breaks.

The retrieval bottleneck

Your embedding model is fast. But it's not smart.

You ask: "How do I fix my car?" Your retrieval returns:

  • A guide to fixing bicycles
  • A post about car insurance
  • An ad for car repair shops
  • One actual car repair guide buried at rank 7

The LLM is good at following instructions, but it can't unbury the signal from noise.

The ranking problem

You have 100 documents. 10 are relevant. 90 are noise.

Your embedding similarity scores:

  • Top 10: scores 0.78 - 0.85
  • Relevant docs: scores 0.75 - 0.81

There's overlap. Your top_k=5 misses relevant docs.

You increase to top_k=20. Now you add noise. The LLM gets confused trying to synthesize contradictions.

The hallucination trigger

LLM gets context that contradicts its training data. Now what?

Your docs say: "The capital of France is London" The LLM knows: "The capital of France is Paris"

Does it:

  1. Hallucinate? (Yes, sometimes)
  2. Say "The documents say Paris but training says Paris"? (Verbose, confusing)
  3. Trust the documents? (No, it shouldn't)

There's no clean solution. You're stuck.

The latency wall

Retrieval adds latency:

  1. Embed query: 50ms
  2. Search vector DB: 100ms
  3. Re-rank results: 150ms
  4. Get LLM output: 500ms

Total: 800ms

Your SLA was 200ms. You're now 4x over budget.

What actually works

  1. Be selective: Filter documents before embedding (date ranges, document types)
  2. Re-rank aggressively: Use a second model to filter top-k results
  3. Cache everything: Cache embeddings, cache chunks, cache LLM outputs
  4. Monitor retrieval quality: Track when your docs don't help the LLM

RAG is powerful. But only when retrieval is good. When retrieval is bad, RAG amplifies the wrong signal.

The uncomfortable truth

RAG doesn't fix poor source data. It just makes it faster to hallucinate based on it.

Spend 80% of your effort on data quality. Spend 20% on architecture.

Share: