Tool-Use Reliability: Four Failure Modes and How to Handle Them

Function calling works beautifully in demos. The LLM sees the tool schemas, picks the right one, generates valid arguments, and the result comes back clean.

Production is different. Users ask questions the demo did not cover. The LLM picks the wrong tool. It generates arguments that fail validation. The external API times out. The schema you shipped last quarter no longer matches the model that was trained on last year's version.

These are not edge cases. They are the default state of a tool-calling agent under real load.

The four failure modes

1. Hallucinated arguments

The LLM calls the right tool but invents argument values that do not exist.

# Tool schema
def get_user(user_id: str) -> dict:
    """Fetch user details by ID."""
    ...
 
# LLM generates this call
{
  "name": "get_user",
  "args": {"user_id": "user_abc_123"}  # this user does not exist
}

The tool call succeeds (no exception), but returns an empty result. The LLM then confidently tells the user "I couldn't find any information about that user" — which may or may not be true.

2. Wrong tool selection

The LLM picks a tool that is semantically adjacent but functionally wrong.

A user asks: "What orders did customer 42 place this month?"

Available tools: get_customer_profile, search_orders, list_all_orders

The LLM picks get_customer_profile. It gets customer metadata but no orders. The answer is wrong, and the LLM does not know it is wrong.

3. Timeout and network failure

External APIs fail. Your agent needs a plan for when they do.

# What happens without handling
tool_result = external_api.call(args)  # hangs for 30 seconds, then raises
 
# The LLM never gets a result
# The graph node raises an unhandled exception
# The user sees a 500

4. Schema drift

You update a tool's interface. The model's system prompt still describes the old interface. The model generates calls that no longer work.

This is the silent killer. It happens gradually, usually after a refactor, and surfaces as intermittent failures that are hard to reproduce.

Handling hallucinated arguments: validate before executing

Add a validation layer between the LLM's tool call decision and the actual tool execution.

from pydantic import BaseModel, ValidationError
 
class GetUserArgs(BaseModel):
    user_id: str
 
    @field_validator("user_id")
    def user_id_must_be_valid_format(cls, v):
        if not v.startswith("usr_") or len(v) < 10:
            raise ValueError(f"Invalid user_id format: {v}")
        return v
 
def validated_tool_node(state: AgentState) -> AgentState:
    tool_call = state["messages"][-1].tool_calls[0]
    tool_name = tool_call["name"]
    args = tool_call["args"]
 
    # Validate args before calling the tool
    validator = TOOL_VALIDATORS.get(tool_name)
    if validator:
        try:
            validated_args = validator(**args)
        except ValidationError as e:
            error_msg = ToolMessage(
                content=f"Argument validation failed: {e}",
                tool_call_id=tool_call["id"],
            )
            return {"messages": [error_msg], "last_tool_error": str(e)}
 
    # Only execute if validation passed
    result = execute_tool(tool_name, args)
    return {"messages": [ToolMessage(content=str(result), tool_call_id=tool_call["id"])]}

Handling wrong tool selection: tool descriptions matter more than code

The single highest-leverage fix for wrong tool selection is better tool descriptions. The LLM reads these at inference time.

# Bad: too vague
@tool
def search_orders(query: str) -> list:
    """Search orders."""
    ...
 
# Good: tells the LLM exactly when to use this
@tool
def search_orders(
    customer_id: str,
    start_date: str,
    end_date: str,
) -> list:
    """
    Search orders for a specific customer within a date range.
 
    Use this tool when the user asks about orders, purchases, or transactions
    for a named customer or customer ID. Requires a date range.
 
    Do NOT use this for browsing all orders — use list_all_orders instead.
    Do NOT use this for customer profile information — use get_customer_profile instead.
    """
    ...

The Do NOT use this for... pattern explicitly rules out the adjacent tools. It is tedious to write and reliably improves selection accuracy.

Handling timeouts: retry with backoff inside a node

import asyncio
from tenacity import retry, stop_after_attempt, wait_exponential
 
@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=1, max=10),
)
async def resilient_api_call(tool_name: str, args: dict) -> str:
    async with asyncio.timeout(10.0):  # 10 second hard limit
        return await external_api.call(tool_name, args)
 
async def tool_node_with_retry(state: AgentState) -> AgentState:
    tool_call = state["messages"][-1].tool_calls[0]
 
    try:
        result = await resilient_api_call(tool_call["name"], tool_call["args"])
        return {
            "messages": [ToolMessage(content=result, tool_call_id=tool_call["id"])],
            "last_tool_error": None,
        }
    except Exception as e:
        # After 3 retries, report the error to the LLM
        error_content = f"Tool call failed after retries: {type(e).__name__}: {str(e)}"
        return {
            "messages": [ToolMessage(content=error_content, tool_call_id=tool_call["id"])],
            "last_tool_error": str(e),
        }

The key is reporting the error back to the LLM as a ToolMessage rather than raising an exception. The LLM can then decide how to recover — retry with different args, use a different tool, or tell the user honestly what went wrong.

Handling schema drift: versioned tool schemas

Treat tool schemas like API contracts. Version them.

TOOL_SCHEMA_VERSION = "2.1.0"
 
SYSTEM_PROMPT = f"""You are an assistant with access to the following tools (schema version {TOOL_SCHEMA_VERSION}).
 
Tools:
- get_user(user_id: str) -> dict
  Returns: {{id, name, email, created_at}}
 
- search_orders(customer_id: str, start_date: str, end_date: str) -> list[dict]
  Returns: list of {{order_id, amount, status, date}}
  Date format: YYYY-MM-DD
 
IMPORTANT: Do not invent argument values. If you do not have a required argument, ask the user for it.
"""

When you update a tool, bump the version in the prompt. This forces you to audit whether the model's understanding of the schema is still accurate.

Observability: trace every tool call

You cannot fix what you cannot see. Log every tool call — what the LLM requested, what arguments it generated, whether validation passed, what the result was, and how long it took.

import time
from dataclasses import dataclass, asdict
 
@dataclass
class ToolCallTrace:
    tool_name: str
    args: dict
    validation_passed: bool
    result_summary: str
    duration_ms: float
    error: str | None
 
def traced_tool_node(state: AgentState) -> AgentState:
    tool_call = state["messages"][-1].tool_calls[0]
    start = time.monotonic()
 
    trace = ToolCallTrace(
        tool_name=tool_call["name"],
        args=tool_call["args"],
        validation_passed=False,
        result_summary="",
        duration_ms=0,
        error=None,
    )
 
    try:
        validated_args = validate_args(tool_call["name"], tool_call["args"])
        trace.validation_passed = True
        result = execute_tool(tool_call["name"], validated_args)
        trace.result_summary = str(result)[:200]
        msg = ToolMessage(content=str(result), tool_call_id=tool_call["id"])
    except Exception as e:
        trace.error = str(e)
        msg = ToolMessage(content=f"Error: {e}", tool_call_id=tool_call["id"])
    finally:
        trace.duration_ms = (time.monotonic() - start) * 1000
        send_to_logging_pipeline(asdict(trace))
 
    return {"messages": [msg]}

With LangSmith, this tracing happens automatically:

import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-key"
os.environ["LANGCHAIN_PROJECT"] = "production-agent"

Every graph invocation, every tool call, every LLM response is recorded. You can filter by tool name, error type, or duration to find patterns in your failures.

The reliability checklist

Before shipping a tool-calling agent:

Every tool has a description that includes what it is NOT for
Every tool's args are validated before execution
Every tool call is wrapped in retry logic with a timeout
Errors are returned as ToolMessage content, not raised exceptions
Schema changes are versioned in the system prompt
Tool call traces are sent to a logging pipeline
You have run the agent against 50+ real user queries and reviewed the traces

The last item is the one most teams skip. Running against real queries — not synthetic test cases — is where you discover the failure modes that were not obvious from the schema.