Essay· 5 min read

Building a Production-Ready LangGraph Agent

State machines, tool nodes, and error recovery — everything the tutorials skip before you ship your first real agent.

Building a Production-Ready LangGraph Agent

Most LangGraph tutorials walk you through a chatbot that calls a search tool. Then they end. What they skip is everything that happens between "it works on my laptop" and "it handles 500 concurrent users without corrupting state."

This post covers the real architecture: typed state, conditional routing, tool node wiring, and error recovery. The kind of stuff you only learn after your first incident.

Why LangGraph over plain LangChain

LangChain gives you chains. Chains are linear. Real agents are not.

A production agent needs to branch, retry, loop back, and hand off to different tools depending on what it finds. LangGraph models this as an explicit state machine — nodes are functions, edges are transitions, and the graph makes the control flow visible and testable.

  User Input


┌─────────────────┐
│     Router      │──── no tool needed ────▶  Direct Reply
└────────┬────────┘
         │ tool needed

┌─────────────────┐
│   Tool Node     │──── success ───────────▶  Validator ──▶ Response
└────────┬────────┘                               │ invalid
         │ error                                  ▼
         ▼                                   Retry Router
┌─────────────────┐
│    Recovery     │──────────────────────────────────────▶  Response
└─────────────────┘

Defining typed state

The first thing you build is your state schema. This is the single source of truth that every node reads from and writes to.

from typing import Annotated, TypedDict
from langgraph.graph.message import add_messages
 
class AgentState(TypedDict):
    messages: Annotated[list, add_messages]
    tool_call_count: int
    last_tool_error: str | None
    validated: bool

add_messages is a reducer — it appends new messages rather than replacing the list. This is how LangGraph handles concurrent writes to the same key without you having to think about it.

Never use a plain list for messages. If two nodes write to a plain list in the same step, one write silently disappears.

Building the nodes

Each node is a function that takes state and returns a partial state update.

from langchain_anthropic import ChatAnthropic
from langchain_core.messages import AIMessage
 
llm = ChatAnthropic(model="claude-sonnet-4-6")
llm_with_tools = llm.bind_tools(tools)
 
def router_node(state: AgentState) -> AgentState:
    response = llm_with_tools.invoke(state["messages"])
    return {"messages": [response]}
 
def tool_node(state: AgentState) -> AgentState:
    last_message = state["messages"][-1]
    results = []
 
    for tool_call in last_message.tool_calls:
        tool = tool_registry[tool_call["name"]]
        try:
            result = tool.invoke(tool_call["args"])
            results.append(ToolMessage(content=str(result), tool_call_id=tool_call["id"]))
        except Exception as e:
            results.append(ToolMessage(
                content=f"ERROR: {str(e)}",
                tool_call_id=tool_call["id"]
            ))
 
    count = state.get("tool_call_count", 0) + 1
    return {"messages": results, "tool_call_count": count}

Conditional edges

This is where LangGraph earns its name. Instead of always going to the same next node, you write a routing function that inspects state and returns which node to visit.

def should_use_tools(state: AgentState) -> str:
    last = state["messages"][-1]
 
    if not hasattr(last, "tool_calls") or not last.tool_calls:
        return "direct_reply"
 
    if state.get("tool_call_count", 0) >= 5:
        return "max_retries_exceeded"
 
    return "tools"
 
def after_validation(state: AgentState) -> str:
    if state.get("validated"):
        return "respond"
    if state.get("tool_call_count", 0) < 3:
        return "retry"
    return "respond"  # give up gracefully after 3 attempts

Assembling the graph

from langgraph.graph import StateGraph, END
 
builder = StateGraph(AgentState)
 
builder.add_node("router", router_node)
builder.add_node("tools", tool_node)
builder.add_node("validator", validator_node)
builder.add_node("recovery", recovery_node)
builder.add_node("direct_reply", direct_reply_node)
 
builder.set_entry_point("router")
 
builder.add_conditional_edges("router", should_use_tools, {
    "tools": "tools",
    "direct_reply": "direct_reply",
    "max_retries_exceeded": "recovery",
})
 
builder.add_edge("tools", "validator")
 
builder.add_conditional_edges("validator", after_validation, {
    "respond": END,
    "retry": "router",
})
 
builder.add_edge("direct_reply", END)
builder.add_edge("recovery", END)
 
graph = builder.compile()

Error recovery node

The recovery node is what separates production agents from demos. When the tool chain fails or the LLM hallucinates a bad tool call three times in a row, you need a graceful exit — not a 500.

def recovery_node(state: AgentState) -> AgentState:
    error_context = state.get("last_tool_error", "unknown error")
    recovery_message = AIMessage(
        content=(
            "I ran into a problem completing that request. "
            f"The underlying issue was: {error_context}. "
            "Here is what I know so far, and what you could try manually..."
        )
    )
    return {"messages": [recovery_message], "validated": True}

Streaming responses

For production UI, you almost always want streaming. LangGraph supports it natively.

async def run_agent(user_input: str):
    initial_state = {
        "messages": [HumanMessage(content=user_input)],
        "tool_call_count": 0,
        "last_tool_error": None,
        "validated": False,
    }
 
    async for chunk in graph.astream(initial_state, stream_mode="updates"):
        for node_name, node_output in chunk.items():
            if "messages" in node_output:
                for msg in node_output["messages"]:
                    if hasattr(msg, "content") and msg.content:
                        yield msg.content

What actually goes wrong in production

Tool schema drift. You update a tool's arguments but forget to update the LLM's system prompt. The model calls the old signature. Nothing catches it until a user reports it.

Fix: version your tool schemas. Treat them like API contracts.

Unbounded loops. A routing function always returns "retry". Your graph loops until you hit a timeout or run out of tokens.

Fix: always track tool_call_count in state and add a hard ceiling in every routing function.

State mutation side effects. A node mutates a list in state directly rather than returning a new value. The reducer sees the same reference and skips the update.

Fix: always return new objects from nodes, never mutate in place.

Missing checkpointing. A 30-step agentic run fails on step 28. You restart from scratch.

Fix: add a checkpointer from the start. It costs nothing and saves everything.

from langgraph.checkpoint.sqlite import SqliteSaver
 
checkpointer = SqliteSaver.from_conn_string("agent_runs.db")
graph = builder.compile(checkpointer=checkpointer)

Running it

config = {"configurable": {"thread_id": "user-session-42"}}
result = graph.invoke(initial_state, config=config)
print(result["messages"][-1].content)

The thread_id is what ties checkpoints to a specific user session. Use your session ID or user ID here — not a random UUID — so you can resume interrupted runs deterministically.

The pattern that holds

Start with the simplest graph that could work: router → tools → END. Add error recovery only when you have a real failure case to recover from. Add validation only when you have a real invariant to enforce. Every node you add is a node that can fail — keep the graph as small as the problem demands.

The tutorials show you how to build a graph. Production teaches you which nodes to leave out.

Share: