[BEE-30018] LLM Tool Use and Function Calling Patterns
INFO
Tool use transforms an LLM from a text generator into an agent that can act on the world. Understanding the protocol mechanics — how tool calls are encoded, how results flow back, how errors are handled, and how parallel execution works — is prerequisite to building reliable AI features on top of any provider's API.
Context
LLMs are stateless text predictors. They cannot query a database, call an API, check the current time, or take any action that requires I/O. The tool use protocol solves this by defining a structured turn-taking contract: the model signals a desired action as a structured call, the host executes it, and the result returns as another message. The model then reasons over the result and either calls more tools or produces a final answer.
This pattern, introduced as the ReAct (Reason + Act) framework by Yao et al. (arXiv:2210.03629, ICLR 2023), showed that interleaving reasoning traces with concrete actions outperformed chain-of-thought alone by grounding reasoning in real-world feedback rather than pure language continuation.
The protocol has since been standardized at the API level. OpenAI introduced function calling in 2023 and has iterated toward a strict mode that guarantees schema compliance through constrained decoding. Anthropic ships a parallel tool_use content block system with native support for streaming tool calls. The Berkeley Function Calling Leaderboard (BFCL) now benchmarks over a hundred models on two thousand real-world function signatures across serial, parallel, and multi-turn scenarios.
Design Thinking
Tool use introduces a new class of failure modes that pure text generation does not have:
Schema compliance failures: The model outputs a tool call with invalid parameters — wrong types, missing required fields, values outside enum sets. Constrained decoding eliminates this at the output layer; schema validation catches it before execution when constrained decoding is unavailable.
Tool selection failures: The model calls the wrong tool, fails to call when it should, or calls tools unnecessarily. These are prompt engineering and schema design problems — the quality of tool descriptions determines selection accuracy more than model size in most cases.
Execution failures: The tool runs but returns an error. The model must decide whether to retry, fall back to a different tool, or surface a degraded response. Error propagation through tool results is a first-class concern.
Security failures: Tool parameters derived from user-controlled content can redirect agent behavior. Parameter validation must happen at the execution layer, not in the prompt.
Best Practices
Structure Tool Schemas for Precision
MUST define description at both the tool level and for each parameter. The model reads descriptions at inference time to decide which tool to call and how to populate parameters:
# OpenAI / Anthropic shared schema structure
search_tool = {
"name": "search_documents",
"description": (
"Search the knowledge base for documents relevant to a query. "
"Use this when the user asks a factual question that may be answered "
"by internal documentation. Do NOT use for real-time data."
),
"input_schema": { # Anthropic; OpenAI uses "parameters"
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query in natural language, 5-100 words."
},
"max_results": {
"type": "integer",
"description": "Number of results to return. Default 5, max 20.",
"default": 5
},
"date_filter": {
"type": "string",
"enum": ["last_7_days", "last_30_days", "last_year", "all_time"],
"description": "Restrict results to documents published within this window."
}
},
"required": ["query"]
}
}MUST use enums for parameters with a fixed value set. An enum on date_filter prevents the model from hallucinating "last_week" or "recent":
SHOULD keep schemas flat. Deeply nested objects create more parsing surface area and inflate argument token counts. If you need nested data, accept a JSON string and parse it in the tool implementation rather than encoding the nesting in the schema.
SHOULD enable strict mode (OpenAI) when available. Strict mode uses constrained decoding to guarantee that every generated argument conforms to the schema — no additional properties, correct types, required fields present:
client.chat.completions.create(
model="gpt-4o",
tools=[{
"type": "function",
"function": {
**tool_definition,
"strict": True, # Guarantees schema compliance via constrained decoding
}
}],
messages=messages,
)Execute Tool Calls in Parallel Where Possible
SHOULD execute independent tool calls in parallel when the model returns multiple tool calls in a single response. Sequential execution for independent calls multiplies latency unnecessarily:
import asyncio
import anthropic
client = anthropic.AsyncAnthropic()
async def run_tool(tool_use_block, tool_registry):
fn = tool_registry[tool_use_block.name]
try:
result = await fn(**tool_use_block.input)
return {"type": "tool_result", "tool_use_id": tool_use_block.id, "content": str(result)}
except Exception as e:
return {
"type": "tool_result",
"tool_use_id": tool_use_block.id,
"content": f"Error: {e}",
"is_error": True,
}
async def run_agent_turn(messages, tools, tool_registry):
response = await client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
tools=tools,
messages=messages,
)
if response.stop_reason != "tool_use":
return response # Final answer
tool_use_blocks = [b for b in response.content if b.type == "tool_use"]
# Execute all tool calls concurrently
tool_results = await asyncio.gather(
*[run_tool(b, tool_registry) for b in tool_use_blocks]
)
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": list(tool_results)})
return await run_agent_turn(messages, tools, tool_registry) # Next turnSHOULD check for dependency ordering before parallelizing. If tool B requires the output of tool A, execute sequentially. Build a simple dependency graph if the agent plan is known ahead of time.
MUST NOT fire concurrent calls to rate-limited external APIs without a concurrency limiter. Four simultaneous calls to the same third-party API may hit per-second rate limits and all fail together. Use asyncio.Semaphore to cap concurrency:
semaphore = asyncio.Semaphore(2) # Max 2 concurrent external calls
async def rate_limited_tool(tool_block, tool_registry):
async with semaphore:
return await run_tool(tool_block, tool_registry)Classify and Handle Tool Errors Explicitly
MUST return structured error information in tool results rather than raising exceptions that abort the agent loop. The model can reason over an error and choose an alternative:
async def safe_tool_executor(tool_block, tool_registry):
if tool_block.name not in tool_registry:
return {
"type": "tool_result",
"tool_use_id": tool_block.id,
"content": f"Unknown tool: {tool_block.name}. Available tools: {list(tool_registry)}",
"is_error": True,
}
fn = tool_registry[tool_block.name]
try:
result = await asyncio.wait_for(fn(**tool_block.input), timeout=30.0)
return {"type": "tool_result", "tool_use_id": tool_block.id, "content": str(result)}
except asyncio.TimeoutError:
return {
"type": "tool_result",
"tool_use_id": tool_block.id,
"content": "Tool timed out after 30 seconds.",
"is_error": True,
}
except ValueError as e:
# Client error — do not retry
return {
"type": "tool_result",
"tool_use_id": tool_block.id,
"content": f"Invalid argument: {e}",
"is_error": True,
}
except Exception as e:
# Transient error — model may retry
return {
"type": "tool_result",
"tool_use_id": tool_block.id,
"content": f"Tool failed: {e}. You may retry.",
"is_error": True,
}SHOULD distinguish client errors (bad arguments — do not retry) from transient errors (network, timeout — may retry) in the error message text. The model uses this signal to decide its next action.
SHOULD set a maximum turn limit to prevent infinite retry loops:
MAX_TURNS = 10
async def agent_loop(messages, tools, tool_registry):
for _ in range(MAX_TURNS):
response = await run_agent_turn(messages, tools, tool_registry)
if response.stop_reason != "tool_use":
return response
raise RuntimeError("Agent exceeded maximum turn limit")Validate Tool Parameters Before Execution
MUST validate all parameters before passing them to tool implementations. Tool input arrives as a dict from the model and must be treated as untrusted data at the execution boundary:
from pydantic import BaseModel, validator
class SearchInput(BaseModel):
query: str
max_results: int = 5
date_filter: str = "all_time"
@validator("query")
def query_length(cls, v):
if len(v) < 3:
raise ValueError("query must be at least 3 characters")
if len(v) > 500:
raise ValueError("query must be at most 500 characters")
return v
@validator("max_results")
def results_range(cls, v):
if not 1 <= v <= 20:
raise ValueError("max_results must be between 1 and 20")
return v
@validator("date_filter")
def valid_filter(cls, v):
allowed = {"last_7_days", "last_30_days", "last_year", "all_time"}
if v not in allowed:
raise ValueError(f"date_filter must be one of {allowed}")
return v
async def search_documents(tool_input: dict) -> str:
params = SearchInput(**tool_input) # Raises ValueError on bad input
return await _execute_search(params)MUST NOT pass tool call arguments directly to shell commands, SQL queries, or file paths without sanitization. Tool parameter injection is the agent equivalent of SQL injection — a malicious value in a user-controlled field can redirect agent behavior across connected systems.
Ensure Tool Idempotency for State-Changing Operations
MUST make state-changing tools idempotent. The agent loop may retry a tool call after a transient failure; a non-idempotent tool (such as "send email" or "charge payment") can execute the action twice if retried:
import hashlib
async def send_email(recipient: str, subject: str, body: str, idempotency_key: str = None) -> str:
"""Send an email. idempotency_key prevents duplicate sends on retry."""
key = idempotency_key or hashlib.sha256(
f"{recipient}:{subject}:{body}".encode()
).hexdigest()
if await email_log.exists(key):
return f"Email already sent (idempotency_key={key})"
await _send(recipient, subject, body)
await email_log.record(key)
return f"Email sent to {recipient}"SHOULD generate idempotency keys from the content of the action (recipient + subject + body), not from a random UUID generated at call time. Content-addressed keys survive agent restarts and allow the model to safely retry after a crash.
Audit Every Tool Call
MUST log every tool invocation with enough context to reconstruct what happened and who triggered it:
import time
async def audited_tool(tool_block, tool_registry, user_id: str, agent_run_id: str):
start = time.monotonic()
result = await safe_tool_executor(tool_block, tool_registry)
elapsed_ms = (time.monotonic() - start) * 1000
await audit_log.record(
timestamp=time.time(),
user_id=user_id,
agent_run_id=agent_run_id,
tool_name=tool_block.name,
tool_input=tool_block.input, # Log the actual args, not just the name
is_error=result.get("is_error", False),
duration_ms=elapsed_ms,
)
return resultAudit logs serve two purposes: security forensics (which tool was called with which arguments by which user) and cost attribution (which agent run consumed which external API quota).
Visual
Related BEEs
- BEE-30002 -- AI Agent Architecture Patterns: tool use is the mechanism through which agents act; this BEE covers protocol mechanics while BEE-30002 covers orchestration, planning, and multi-agent topologies
- BEE-30006 -- Structured Output and Constrained Decoding: strict mode and guided decoding for tool arguments use the same constrained decoding infrastructure as structured JSON output
- BEE-30008 -- LLM Security and Prompt Injection: tool parameter injection is a specific attack vector enabled by tool use; indirect prompt injection via tool results is a first-class threat
- BEE-30016 -- LLM Streaming Patterns: tool call arguments stream incrementally and must be accumulated before parsing; stop_reason signals are the same in streaming and batch modes
- BEE-30017 -- AI Memory Systems for Long-Running Agents: memory retrieval and storage are typically implemented as tools in a tool-use framework
References
- Shunyu Yao et al. ReAct: Synergizing Reasoning and Acting in Language Models — arXiv:2210.03629, ICLR 2023
- Shishir G. Patil et al. Gorilla: Large Language Model Connected with Massive APIs — Berkeley Function Calling Leaderboard, ICML 2025
- Berkeley Function Calling Leaderboard — gorilla.cs.berkeley.edu
- OpenAI. Function Calling — platform.openai.com
- Anthropic. Tool Use Overview — docs.anthropic.com
- Anthropic. Building Effective Agents — anthropic.com
- Lilian Weng. LLM Powered Autonomous Agents — lilianweng.github.io
- Martin Fowler. Function Calling Using LLMs — martinfowler.com
- OWASP. Top 10 for Agentic Applications 2026 — genai.owasp.org