Agentic RAG: Self-Correcting Retrieval That Works in Production

Standard RAG is a pipeline: embed the query, fetch the top-k chunks, stuff them into a prompt, call the LLM. Simple to build, simple to break.

The failure mode is silent. The retriever returns something — it always returns something — and the LLM does its best with whatever it gets. If the chunks are off-topic, the answer is confidently wrong. If the query was ambiguous, the retriever picks the wrong interpretation and never tells you.

Agentic RAG fixes this by making retrieval a loop rather than a single step. The system can judge the quality of what it retrieved, rewrite the query, and try again before the user ever sees an answer.

The naive RAG failure

Consider a knowledge base about cloud infrastructure. The user asks:

"Why is my function running slow after the cold start?"

A naive retriever might return chunks about:

Lambda function pricing
Cold start times for different runtimes
General AWS performance tuning

The second chunk is relevant. The others dilute the context. The LLM averages across all of them and gives a vague answer about "optimising your Lambda configuration."

The user wanted to know about provisioned concurrency. The answer was in the knowledge base. The retriever just fetched the wrong neighbourhood.

Corrective RAG: the core loop

  User Query
      │
      ▼
┌─────────────────────┐
│   Retrieve (top-k)  │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  Relevance Grader   │──── relevant ────▶  Generate  ──▶  Answer
└──────────┬──────────┘
           │ not relevant
           ▼
┌─────────────────────┐
│   Query Rewriter    │──── retry ────────▶  Retrieve
└──────────┬──────────┘
           │ still no match
           ▼
┌─────────────────────┐
│     Web Search      │──── fallback ─────▶  Generate  ──▶  Answer
└─────────────────────┘

The grader is the key addition. It is a fast LLM call that scores whether the retrieved documents actually address the query.

Building the grader

from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field
 
class RelevanceScore(BaseModel):
    score: str = Field(description="Relevance score: 'yes' or 'no'")
    reason: str = Field(description="One sentence explanation")
 
grader_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a document relevance grader.
Score whether a retrieved document addresses the user's question.
Return 'yes' if it contains useful information, 'no' if it does not.
Be strict — partial relevance is not relevance."""),
    ("human", "Document:\n{document}\n\nQuestion: {question}"),
])
 
llm = ChatAnthropic(model="claude-haiku-4-5-20251001")  # fast, cheap grader
grader = grader_prompt | llm.with_structured_output(RelevanceScore)
 
def grade_documents(state: RAGState) -> RAGState:
    question = state["question"]
    documents = state["documents"]
 
    relevant_docs = []
    for doc in documents:
        score = grader.invoke({"document": doc.page_content, "question": question})
        if score.score == "yes":
            relevant_docs.append(doc)
 
    return {
        "documents": relevant_docs,
        "retrieval_grade": "relevant" if relevant_docs else "not_relevant",
    }

Use a fast model (Haiku) for grading — you will call it once per retrieved chunk. Use your best model only for the final generation step.

Query rewriting

When the grader finds no relevant documents, the query rewriter tries to rephrase the question so the retriever has a better chance.

rewriter_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a query optimiser for a retrieval system.
Rewrite the question to be more specific and likely to match relevant documents.
Focus on the core technical concept. Remove ambiguity.
Return only the rewritten question, nothing else."""),
    ("human", "Original question: {question}"),
])
 
rewriter = rewriter_prompt | ChatAnthropic(model="claude-haiku-4-5-20251001")
 
def rewrite_query(state: RAGState) -> RAGState:
    question = state["question"]
    rewritten = rewriter.invoke({"question": question})
    return {
        "question": rewritten.content,
        "rewrite_count": state.get("rewrite_count", 0) + 1,
    }

Web search fallback

If rewriting still produces no relevant documents, fall back to web search. This covers queries that the knowledge base simply does not contain.

from langchain_community.tools import TavilySearchResults
 
web_search_tool = TavilySearchResults(max_results=3)
 
def web_search(state: RAGState) -> RAGState:
    results = web_search_tool.invoke({"query": state["question"]})
    web_docs = [Document(page_content=r["content"]) for r in results]
    return {"documents": web_docs, "used_web_search": True}

State and routing

from typing import Literal, TypedDict
from langchain_core.documents import Document
 
class RAGState(TypedDict):
    question: str
    documents: list[Document]
    generation: str
    retrieval_grade: str
    rewrite_count: int
    used_web_search: bool
 
def route_after_grading(state: RAGState) -> Literal["generate", "rewrite_query"]:
    if state["retrieval_grade"] == "relevant":
        return "generate"
    if state.get("rewrite_count", 0) >= 2:
        return "web_search"  # give up on rewriting, try web
    return "rewrite_query"
 
def route_after_rewrite(state: RAGState) -> Literal["retrieve", "web_search"]:
    if state.get("rewrite_count", 0) >= 2:
        return "web_search"
    return "retrieve"

Assembling the graph

from langgraph.graph import StateGraph, END
 
builder = StateGraph(RAGState)
 
builder.add_node("retrieve", retrieve_documents)
builder.add_node("grade_documents", grade_documents)
builder.add_node("rewrite_query", rewrite_query)
builder.add_node("web_search", web_search)
builder.add_node("generate", generate_answer)
 
builder.set_entry_point("retrieve")
builder.add_edge("retrieve", "grade_documents")
 
builder.add_conditional_edges("grade_documents", route_after_grading, {
    "generate": "generate",
    "rewrite_query": "rewrite_query",
    "web_search": "web_search",
})
 
builder.add_conditional_edges("rewrite_query", route_after_rewrite, {
    "retrieve": "retrieve",
    "web_search": "web_search",
})
 
builder.add_edge("web_search", "generate")
builder.add_edge("generate", END)
 
graph = builder.compile()

Hallucination and answer grading

One more grader that matters: check whether the generated answer is actually supported by the retrieved documents.

class HallucinationScore(BaseModel):
    grounded: bool = Field(description="True if the answer is supported by the documents")
 
hallucination_prompt = ChatPromptTemplate.from_messages([
    ("system", "Determine if the answer is fully supported by the provided documents. Return true only if every factual claim in the answer appears in the documents."),
    ("human", "Documents:\n{documents}\n\nAnswer: {generation}"),
])
 
hallucination_grader = hallucination_prompt | llm.with_structured_output(HallucinationScore)
 
def check_hallucination(state: RAGState) -> Literal["useful", "not_grounded"]:
    score = hallucination_grader.invoke({
        "documents": "\n\n".join(d.page_content for d in state["documents"]),
        "generation": state["generation"],
    })
    return "useful" if score.grounded else "not_grounded"

Add this as a conditional edge after generate. If the answer is not grounded, loop back to retrieve with the original question.

The cost of self-correction

Every grader call is a token spend. In practice:

Grading 5 documents costs ~500 tokens with a fast model
One query rewrite costs ~200 tokens
One hallucination check costs ~600 tokens

For a query that goes through two retrieval attempts and one hallucination check, you spend roughly 2,500 extra tokens. At current pricing this is fractions of a cent — worth it to avoid a confidently wrong answer.

What is not worth it: running an expensive model as the grader. Use Haiku or a local model for grading. Save your budget for the generation step.

When to use agentic RAG

Use it when:

Your knowledge base covers narrow, technical domains where retrieval precision matters
Wrong answers have real consequences (support tickets, medical information, financial data)
Users ask ambiguous or multi-part questions that a single vector query handles poorly

Skip it when:

Your queries are simple and well-formed
Latency matters more than accuracy (each grader call adds 200–400ms)
Your knowledge base is small enough that top-k recall is near 100%

The corrective loop adds complexity. Add it because your retrieval is failing, not because the architecture looks impressive.