If you've built anything with an LLM in the last year, you've hit the wall: the model can generate beautiful text, but it can't do anything. It can't check a database. It can't call your API. It can't look up the weather, send an email, or query your company's internal knowledge base. It's a brilliant conversationalist trapped in a room with no doors.
Function calling — also called tool use — is the door. It's the mechanism that lets an LLM output structured data to invoke external functions instead of just generating text. And in 2026, it's the single most important skill for any engineer building AI-powered applications.
This guide covers everything: how function calling works under the hood, code examples across OpenAI, Anthropic Claude, and Google Gemini, the MCP protocol that's standardizing tool access, and the production patterns that separate toy demos from real systems.
How Function Calling Works
The core idea is elegant. Instead of just returning text, the model can return a structured request to call a specific function with specific arguments. Your code executes the function, returns the result, and the model uses that result to generate its final response.
The critical insight is that the model never executes the function itself. It only generates the intent — a JSON object with the function name and arguments. Your application code is responsible for the actual execution. This is a safety feature: the model proposes actions, and your code validates and executes them.
What you define for each tool
Every tool definition includes three things:
- Name. A clear, descriptive identifier like
get_weatherorsearch_knowledge_base. The model uses this name to decide which tool to call. - Description. A natural-language explanation of what the tool does and when to use it. This is critical — the model reads this description to determine whether the tool is relevant to the user's request.
- Parameters. A JSON Schema definition of the function's input parameters: types, required fields, enums, descriptions for each field. Better schemas lead to more accurate function calls.
Function Calling with OpenAI
OpenAI popularized function calling with GPT-3.5 in June 2023 and has since evolved the API significantly. As of 2026, GPT-4.1 achieves 97-99% accuracy on function calling benchmarks and supports up to 128 tools per request.
from openai import OpenAI
client = OpenAI()
# Define the tools
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {
"type": "string",
"description": "City name"
}
},
"required": ["city"]
}
}
}]
# Send message with tools
response = client.chat.completions.create(
model="gpt-4.1",
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
tools=tools
)
# The model returns a tool_call, not text
tool_call = response.choices[0].message.tool_calls[0]
# tool_call.function.name == "get_weather"
# tool_call.function.arguments == '{"city": "Tokyo"}'
Function Calling with Anthropic Claude
Anthropic calls it "tool use" and supports parallel tool calls natively, with a maximum of 64 tools per request. Claude Opus 4 and Sonnet 4 are both highly reliable at structured tool calling, and Claude's extended thinking mode gives it an edge on complex multi-step tool chains where reasoning about which tools to call matters.
import anthropic
client = anthropic.Anthropic()
# Define tools
tools = [{
"name": "get_weather",
"description": "Get current weather for a city",
"input_schema": {
"type": "object",
"properties": {
"city": {
"type": "string",
"description": "City name"
}
},
"required": ["city"]
}
}]
# Send message with tools
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
tools=tools,
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}]
)
# Claude returns tool_use content blocks
for block in response.content:
if block.type == "tool_use":
# block.name == "get_weather"
# block.input == {"city": "Tokyo"}
result = execute_tool(block.name, block.input)
Provider Comparison
Each provider implements function calling differently. Here's what matters in practice:
| Feature | OpenAI (GPT-4.1) | Anthropic (Claude Opus 4) | Google (Gemini 2.5 Pro) |
|---|---|---|---|
| Max tools per request | 128 | 64 | 64 |
| Parallel tool calls | Yes (default) | Yes (native) | Yes |
| Forced tool use | tool_choice: "required" |
tool_choice: {"type": "any"} |
tool_config: {mode: "ANY"} |
| Streaming support | Yes | Yes | Yes |
| Accuracy (benchmarks) | 97-99% | 96-98% | 93-96% |
| Schema format | JSON Schema | JSON Schema | JSON Schema (OpenAPI subset) |
| Token overhead per tool | ~100-300 tokens | ~100-300 tokens | ~150-350 tokens |
The MCP Protocol: Universal Tool Access
The Model Context Protocol (MCP), created by Anthropic and open-sourced in November 2024, is rapidly becoming the standard for connecting AI models to external tools. Think of it as USB-C for AI: instead of writing custom function definitions for each provider, you define your tools once as an MCP server, and any MCP-compatible client can use them.
MCP solves three problems that direct function calling doesn't:
- Discovery. With function calling, you must tell the model about every available tool upfront. MCP supports tool discovery — the model can query an MCP server to find what tools are available, including search-based discovery when the tool catalog is large.
- Portability. A tool defined as an MCP server works with Claude, GPT, Gemini, or any MCP-compatible client. No rewriting tool definitions per provider.
- Composability. An agent can connect to multiple MCP servers simultaneously — one for your database, one for GitHub, one for Slack — and the model can use tools from any of them in a single conversation.
from mcp.server import FastMCP
app = FastMCP("weather-server")
# Define a tool — any MCP client can call it
@app.tool()
async def get_weather(city: str) -> str:
"""Get current weather for a city."""
# Your implementation here
weather = await fetch_weather_api(city)
return f"{city}: {weather.temp}°F, {weather.condition}"
@app.tool()
async def get_forecast(city: str, days: int = 5) -> str:
"""Get weather forecast for the next N days."""
forecast = await fetch_forecast_api(city, days)
return forecast.to_json()
Production Patterns
Building a demo with function calling takes an hour. Building a production system takes weeks of learning the hard way. Here are the patterns that matter:
1. Write better tool descriptions
The model chooses which tool to call based almost entirely on the description you write. A bad description — "Gets data" — leads to wrong tool selection. A good description explains what the tool does, when to use it, and what it returns.
"The single most impactful thing you can do to improve tool calling accuracy is to write better descriptions. Engineers spend hours optimizing their prompts and five seconds on tool descriptions. Flip that ratio."— Anthropic Engineering, "Writing effective tools for AI agents"
2. Validate tool arguments before execution
The model generates JSON arguments, but it can hallucinate field names, use wrong types, or produce invalid values. Always validate against your schema before executing. Libraries like Pydantic (Python) or Zod (TypeScript) make this trivial.
3. Handle parallel tool calls
Modern models frequently call multiple tools in parallel — "What's the weather in Tokyo and New York?" produces two simultaneous tool calls. Your code must handle this: execute them concurrently, collect results, and send them all back in a single response.
4. Limit tool count for cost and accuracy
Each tool definition adds ~100-300 tokens to every API request. With 20 tools, you're burning 3,000-6,000 tokens before the conversation even starts. More importantly, accuracy degrades as tool count increases. Production systems use tool filtering: only send the 5-10 tools most relevant to the current conversation context.
5. Implement retry with fallback
Tool calls can fail — the external API times out, the database is down, rate limits kick in. Always return a clear error message to the model instead of throwing an exception. The model can often recover gracefully: "The weather API is currently unavailable. Based on historical data for Tokyo in late May, temperatures typically range from 65-78°F."
Function Calling vs. MCP: When to Use What
Quick decision guide
Use direct function calling when: you have fewer than 10 tools, a single LLM provider, and a simple request-response pattern. It's simpler to set up and debug. Use MCP when: you need tool portability across providers, have a large or dynamic tool catalog, or are building multi-agent systems where agents need to discover and share tools. MCP adds complexity upfront but pays off as the system scales.
Building Your First Agent with Tools
Here's the mental model for building a useful agent with function calling:
- Start with one tool. Pick the simplest, most useful tool for your use case. Get it working end-to-end before adding more.
- Build the loop. The agent needs a loop: receive user message → call model → if tool call, execute and loop → if text, return to user. Most frameworks handle this, but it's worth building manually once to understand the mechanics.
- Add error boundaries. Set a maximum number of tool calls per turn (typically 5-10) to prevent infinite loops. Implement timeouts on tool execution. Return structured errors the model can understand.
- Instrument everything. Log every tool call: which tool, what arguments, execution time, result. This is your debugging lifeline when the agent behaves unexpectedly.
- Test with adversarial inputs. Users will ask the agent to do things your tools can't handle. Test edge cases: empty inputs, invalid cities, SQL injection attempts in search queries. The model is generally good at handling these, but your tool implementations need to be robust.
The Job Market for Tool-Use Skills
Function calling and tool use are now table-stakes skills for AI engineering roles. Every major AI company — from OpenAI and Anthropic to startups building on top of their APIs — requires engineers who can build reliable tool-calling systems.
Roles that specifically require these skills include AI Engineer, ML Platform Engineer, AI Application Developer, and the increasingly common "AI Agent Engineer" title. Compensation for these roles ranges from $180K to $450K+ total comp depending on level and company.
Find AI engineering roles
Browse AI and ML engineering roles at companies that are building the next generation of intelligent systems.
Browse AI/ML Jobs → Explore AI Skills →