Tool-Use Reliability: Four Failure Modes and How to Handle Them
Hallucinated arguments, wrong tool selection, timeouts, schema drift — a field guide to keeping function-calling agents honest.
On this pageClick to expand
- The four failure modes
- 1. Hallucinated arguments
- 2. Wrong tool selection
- 3. Timeout and network failure
- 4. Schema drift
- Handling hallucinated arguments: validate before executing
- Handling wrong tool selection: tool descriptions matter more than code
- Handling timeouts: retry with backoff inside a node
- Handling schema drift: versioned tool schemas
- Observability: trace every tool call
- The reliability checklist
Tool-Use Reliability: Four Failure Modes and How to Handle Them
Function calling works beautifully in demos. The LLM sees the tool schemas, picks the right one, generates valid arguments, and the result comes back clean.
Production is different. Users ask questions the demo did not cover. The LLM picks the wrong tool. It generates arguments that fail validation. The external API times out. The schema you shipped last quarter no longer matches the model that was trained on last year's version.
These are not edge cases. They are the default state of a tool-calling agent under real load.
The four failure modes
1. Hallucinated arguments
The LLM calls the right tool but invents argument values that do not exist.
# Tool schema
def get_user(user_id: str) -> dict:
"""Fetch user details by ID."""
...
# LLM generates this call
{
"name": "get_user",
"args": {"user_id": "user_abc_123"} # this user does not exist
}The tool call succeeds (no exception), but returns an empty result. The LLM then confidently tells the user "I couldn't find any information about that user" — which may or may not be true.
2. Wrong tool selection
The LLM picks a tool that is semantically adjacent but functionally wrong.
A user asks: "What orders did customer 42 place this month?"
Available tools: get_customer_profile, search_orders, list_all_orders
The LLM picks get_customer_profile. It gets customer metadata but no orders. The answer is wrong, and the LLM does not know it is wrong.
3. Timeout and network failure
External APIs fail. Your agent needs a plan for when they do.
# What happens without handling
tool_result = external_api.call(args) # hangs for 30 seconds, then raises
# The LLM never gets a result
# The graph node raises an unhandled exception
# The user sees a 5004. Schema drift
You update a tool's interface. The model's system prompt still describes the old interface. The model generates calls that no longer work.
This is the silent killer. It happens gradually, usually after a refactor, and surfaces as intermittent failures that are hard to reproduce.
Handling hallucinated arguments: validate before executing
Add a validation layer between the LLM's tool call decision and the actual tool execution.
from pydantic import BaseModel, ValidationError
class GetUserArgs(BaseModel):
user_id: str
@field_validator("user_id")
def user_id_must_be_valid_format(cls, v):
if not v.startswith("usr_") or len(v) < 10:
raise ValueError(f"Invalid user_id format: {v}")
return v
def validated_tool_node(state: AgentState) -> AgentState:
tool_call = state["messages"][-1].tool_calls[0]
tool_name = tool_call["name"]
args = tool_call["args"]
# Validate args before calling the tool
validator = TOOL_VALIDATORS.get(tool_name)
if validator:
try:
validated_args = validator(**args)
except ValidationError as e:
error_msg = ToolMessage(
content=f"Argument validation failed: {e}",
tool_call_id=tool_call["id"],
)
return {"messages": [error_msg], "last_tool_error": str(e)}
# Only execute if validation passed
result = execute_tool(tool_name, args)
return {"messages": [ToolMessage(content=str(result), tool_call_id=tool_call["id"])]}Handling wrong tool selection: tool descriptions matter more than code
The single highest-leverage fix for wrong tool selection is better tool descriptions. The LLM reads these at inference time.
# Bad: too vague
@tool
def search_orders(query: str) -> list:
"""Search orders."""
...
# Good: tells the LLM exactly when to use this
@tool
def search_orders(
customer_id: str,
start_date: str,
end_date: str,
) -> list:
"""
Search orders for a specific customer within a date range.
Use this tool when the user asks about orders, purchases, or transactions
for a named customer or customer ID. Requires a date range.
Do NOT use this for browsing all orders — use list_all_orders instead.
Do NOT use this for customer profile information — use get_customer_profile instead.
"""
...The Do NOT use this for... pattern explicitly rules out the adjacent tools. It is tedious to write and reliably improves selection accuracy.
Handling timeouts: retry with backoff inside a node
import asyncio
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=1, max=10),
)
async def resilient_api_call(tool_name: str, args: dict) -> str:
async with asyncio.timeout(10.0): # 10 second hard limit
return await external_api.call(tool_name, args)
async def tool_node_with_retry(state: AgentState) -> AgentState:
tool_call = state["messages"][-1].tool_calls[0]
try:
result = await resilient_api_call(tool_call["name"], tool_call["args"])
return {
"messages": [ToolMessage(content=result, tool_call_id=tool_call["id"])],
"last_tool_error": None,
}
except Exception as e:
# After 3 retries, report the error to the LLM
error_content = f"Tool call failed after retries: {type(e).__name__}: {str(e)}"
return {
"messages": [ToolMessage(content=error_content, tool_call_id=tool_call["id"])],
"last_tool_error": str(e),
}The key is reporting the error back to the LLM as a ToolMessage rather than raising an exception. The LLM can then decide how to recover — retry with different args, use a different tool, or tell the user honestly what went wrong.
Handling schema drift: versioned tool schemas
Treat tool schemas like API contracts. Version them.
TOOL_SCHEMA_VERSION = "2.1.0"
SYSTEM_PROMPT = f"""You are an assistant with access to the following tools (schema version {TOOL_SCHEMA_VERSION}).
Tools:
- get_user(user_id: str) -> dict
Returns: {{id, name, email, created_at}}
- search_orders(customer_id: str, start_date: str, end_date: str) -> list[dict]
Returns: list of {{order_id, amount, status, date}}
Date format: YYYY-MM-DD
IMPORTANT: Do not invent argument values. If you do not have a required argument, ask the user for it.
"""When you update a tool, bump the version in the prompt. This forces you to audit whether the model's understanding of the schema is still accurate.
Observability: trace every tool call
You cannot fix what you cannot see. Log every tool call — what the LLM requested, what arguments it generated, whether validation passed, what the result was, and how long it took.
import time
from dataclasses import dataclass, asdict
@dataclass
class ToolCallTrace:
tool_name: str
args: dict
validation_passed: bool
result_summary: str
duration_ms: float
error: str | None
def traced_tool_node(state: AgentState) -> AgentState:
tool_call = state["messages"][-1].tool_calls[0]
start = time.monotonic()
trace = ToolCallTrace(
tool_name=tool_call["name"],
args=tool_call["args"],
validation_passed=False,
result_summary="",
duration_ms=0,
error=None,
)
try:
validated_args = validate_args(tool_call["name"], tool_call["args"])
trace.validation_passed = True
result = execute_tool(tool_call["name"], validated_args)
trace.result_summary = str(result)[:200]
msg = ToolMessage(content=str(result), tool_call_id=tool_call["id"])
except Exception as e:
trace.error = str(e)
msg = ToolMessage(content=f"Error: {e}", tool_call_id=tool_call["id"])
finally:
trace.duration_ms = (time.monotonic() - start) * 1000
send_to_logging_pipeline(asdict(trace))
return {"messages": [msg]}With LangSmith, this tracing happens automatically:
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-key"
os.environ["LANGCHAIN_PROJECT"] = "production-agent"Every graph invocation, every tool call, every LLM response is recorded. You can filter by tool name, error type, or duration to find patterns in your failures.
The reliability checklist
Before shipping a tool-calling agent:
- Every tool has a description that includes what it is NOT for
- Every tool's args are validated before execution
- Every tool call is wrapped in retry logic with a timeout
- Errors are returned as
ToolMessagecontent, not raised exceptions - Schema changes are versioned in the system prompt
- Tool call traces are sent to a logging pipeline
- You have run the agent against 50+ real user queries and reviewed the traces
The last item is the one most teams skip. Running against real queries — not synthetic test cases — is where you discover the failure modes that were not obvious from the schema.