Agentic RAG: Self-Correcting Retrieval That Works in Production
Naive vector search breaks silently. Corrective RAG loops catch the failure and rewrite the query before the user notices.
Agentic RAG: Self-Correcting Retrieval That Works in Production
Standard RAG is a pipeline: embed the query, fetch the top-k chunks, stuff them into a prompt, call the LLM. Simple to build, simple to break.
The failure mode is silent. The retriever returns something — it always returns something — and the LLM does its best with whatever it gets. If the chunks are off-topic, the answer is confidently wrong. If the query was ambiguous, the retriever picks the wrong interpretation and never tells you.
Agentic RAG fixes this by making retrieval a loop rather than a single step. The system can judge the quality of what it retrieved, rewrite the query, and try again before the user ever sees an answer.
The naive RAG failure
Consider a knowledge base about cloud infrastructure. The user asks:
"Why is my function running slow after the cold start?"
A naive retriever might return chunks about:
- Lambda function pricing
- Cold start times for different runtimes
- General AWS performance tuning
The second chunk is relevant. The others dilute the context. The LLM averages across all of them and gives a vague answer about "optimising your Lambda configuration."
The user wanted to know about provisioned concurrency. The answer was in the knowledge base. The retriever just fetched the wrong neighbourhood.
Corrective RAG: the core loop
User Query
│
▼
┌─────────────────────┐
│ Retrieve (top-k) │
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ Relevance Grader │──── relevant ────▶ Generate ──▶ Answer
└──────────┬──────────┘
│ not relevant
▼
┌─────────────────────┐
│ Query Rewriter │──── retry ────────▶ Retrieve
└──────────┬──────────┘
│ still no match
▼
┌─────────────────────┐
│ Web Search │──── fallback ─────▶ Generate ──▶ Answer
└─────────────────────┘The grader is the key addition. It is a fast LLM call that scores whether the retrieved documents actually address the query.
Building the grader
from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field
class RelevanceScore(BaseModel):
score: str = Field(description="Relevance score: 'yes' or 'no'")
reason: str = Field(description="One sentence explanation")
grader_prompt = ChatPromptTemplate.from_messages([
("system", """You are a document relevance grader.
Score whether a retrieved document addresses the user's question.
Return 'yes' if it contains useful information, 'no' if it does not.
Be strict — partial relevance is not relevance."""),
("human", "Document:\n{document}\n\nQuestion: {question}"),
])
llm = ChatAnthropic(model="claude-haiku-4-5-20251001") # fast, cheap grader
grader = grader_prompt | llm.with_structured_output(RelevanceScore)
def grade_documents(state: RAGState) -> RAGState:
question = state["question"]
documents = state["documents"]
relevant_docs = []
for doc in documents:
score = grader.invoke({"document": doc.page_content, "question": question})
if score.score == "yes":
relevant_docs.append(doc)
return {
"documents": relevant_docs,
"retrieval_grade": "relevant" if relevant_docs else "not_relevant",
}Use a fast model (Haiku) for grading — you will call it once per retrieved chunk. Use your best model only for the final generation step.
Query rewriting
When the grader finds no relevant documents, the query rewriter tries to rephrase the question so the retriever has a better chance.
rewriter_prompt = ChatPromptTemplate.from_messages([
("system", """You are a query optimiser for a retrieval system.
Rewrite the question to be more specific and likely to match relevant documents.
Focus on the core technical concept. Remove ambiguity.
Return only the rewritten question, nothing else."""),
("human", "Original question: {question}"),
])
rewriter = rewriter_prompt | ChatAnthropic(model="claude-haiku-4-5-20251001")
def rewrite_query(state: RAGState) -> RAGState:
question = state["question"]
rewritten = rewriter.invoke({"question": question})
return {
"question": rewritten.content,
"rewrite_count": state.get("rewrite_count", 0) + 1,
}Web search fallback
If rewriting still produces no relevant documents, fall back to web search. This covers queries that the knowledge base simply does not contain.
from langchain_community.tools import TavilySearchResults
web_search_tool = TavilySearchResults(max_results=3)
def web_search(state: RAGState) -> RAGState:
results = web_search_tool.invoke({"query": state["question"]})
web_docs = [Document(page_content=r["content"]) for r in results]
return {"documents": web_docs, "used_web_search": True}State and routing
from typing import Literal, TypedDict
from langchain_core.documents import Document
class RAGState(TypedDict):
question: str
documents: list[Document]
generation: str
retrieval_grade: str
rewrite_count: int
used_web_search: bool
def route_after_grading(state: RAGState) -> Literal["generate", "rewrite_query"]:
if state["retrieval_grade"] == "relevant":
return "generate"
if state.get("rewrite_count", 0) >= 2:
return "web_search" # give up on rewriting, try web
return "rewrite_query"
def route_after_rewrite(state: RAGState) -> Literal["retrieve", "web_search"]:
if state.get("rewrite_count", 0) >= 2:
return "web_search"
return "retrieve"Assembling the graph
from langgraph.graph import StateGraph, END
builder = StateGraph(RAGState)
builder.add_node("retrieve", retrieve_documents)
builder.add_node("grade_documents", grade_documents)
builder.add_node("rewrite_query", rewrite_query)
builder.add_node("web_search", web_search)
builder.add_node("generate", generate_answer)
builder.set_entry_point("retrieve")
builder.add_edge("retrieve", "grade_documents")
builder.add_conditional_edges("grade_documents", route_after_grading, {
"generate": "generate",
"rewrite_query": "rewrite_query",
"web_search": "web_search",
})
builder.add_conditional_edges("rewrite_query", route_after_rewrite, {
"retrieve": "retrieve",
"web_search": "web_search",
})
builder.add_edge("web_search", "generate")
builder.add_edge("generate", END)
graph = builder.compile()Hallucination and answer grading
One more grader that matters: check whether the generated answer is actually supported by the retrieved documents.
class HallucinationScore(BaseModel):
grounded: bool = Field(description="True if the answer is supported by the documents")
hallucination_prompt = ChatPromptTemplate.from_messages([
("system", "Determine if the answer is fully supported by the provided documents. Return true only if every factual claim in the answer appears in the documents."),
("human", "Documents:\n{documents}\n\nAnswer: {generation}"),
])
hallucination_grader = hallucination_prompt | llm.with_structured_output(HallucinationScore)
def check_hallucination(state: RAGState) -> Literal["useful", "not_grounded"]:
score = hallucination_grader.invoke({
"documents": "\n\n".join(d.page_content for d in state["documents"]),
"generation": state["generation"],
})
return "useful" if score.grounded else "not_grounded"Add this as a conditional edge after generate. If the answer is not grounded, loop back to retrieve with the original question.
The cost of self-correction
Every grader call is a token spend. In practice:
- Grading 5 documents costs ~500 tokens with a fast model
- One query rewrite costs ~200 tokens
- One hallucination check costs ~600 tokens
For a query that goes through two retrieval attempts and one hallucination check, you spend roughly 2,500 extra tokens. At current pricing this is fractions of a cent — worth it to avoid a confidently wrong answer.
What is not worth it: running an expensive model as the grader. Use Haiku or a local model for grading. Save your budget for the generation step.
When to use agentic RAG
Use it when:
- Your knowledge base covers narrow, technical domains where retrieval precision matters
- Wrong answers have real consequences (support tickets, medical information, financial data)
- Users ask ambiguous or multi-part questions that a single vector query handles poorly
Skip it when:
- Your queries are simple and well-formed
- Latency matters more than accuracy (each grader call adds 200–400ms)
- Your knowledge base is small enough that top-k recall is near 100%
The corrective loop adds complexity. Add it because your retrieval is failing, not because the architecture looks impressive.