Tool Use & Function Calling: Giving LLMs Hands
You build a customer support chatbot. A user asks “What is my order status?” The chatbot responds: “I’d be happy to help you check your order status! Please provide your order number and I’ll look that up for you.” The user provides the order number. The chatbot says: “Thank you! Your order status should be available in your account dashboard.”
It never actually checked anything. It generated helpful-sounding text without doing anything useful. Now imagine the same chatbot with function calling: the user asks about their order, the model calls get_order_status(order_id="ORD-12345"), receives the actual status from your backend, and responds: “Your order ORD-12345 shipped yesterday via FedEx. Tracking number: 7891234. Expected delivery: Thursday.”
Function calling is the mechanism that transforms LLMs from text generators into systems that can interact with the real world. It is the difference between a model that talks about doing things and one that actually does them.
What function calling actually is
Function calling (also called tool use) is a protocol where:
- You define available functions with their names, descriptions, and parameter schemas
- The model decides when to call a function (based on the user’s request)
- The model generates a structured function call (name + arguments as JSON)
- Your application executes the function and returns the result
- The model incorporates the result into its response
The model never executes functions itself - it generates the intent to call them. Your code handles execution. This is a critical safety property: you control what actually happens.
graph TD
U["User: 'What's the weather in Tokyo?'"] --> M["Model reasons:
need weather data → call get_weather"]
M --> FC["Function Call:
get_weather(location='Tokyo')"]
FC --> APP["Your Application:
executes the function"]
APP --> API["Weather API:
returns data"]
API --> RES["Result: {temp: 22, condition: 'sunny'}"]
RES --> M2["Model generates response:
'It's 22°C and sunny in Tokyo'"]
style U fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style FC fill:#E1F5EE,stroke:#0F6E56,color:#085041
style APP fill:#FAEEDA,stroke:#854F0B,color:#633806
style M2 fill:#EEEDFE,stroke:#534AB7,color:#3C3489
How function calling works with APIs
Defining tools
tools = [
{
"type": "function",
"function": {
"name": "get_order_status",
"description": "Look up the current status of a customer order by order ID",
"parameters": {
"type": "object",
"properties": {
"order_id": {
"type": "string",
"description": "The order ID (format: ORD-XXXXX)"
}
},
"required": ["order_id"]
}
}
},
{
"type": "function",
"function": {
"name": "initiate_refund",
"description": "Start the refund process for an order. Only call after confirming with the customer.",
"parameters": {
"type": "object",
"properties": {
"order_id": {"type": "string"},
"reason": {"type": "string", "enum": ["damaged", "wrong_item", "not_delivered", "changed_mind"]},
"amount_cents": {"type": "integer"}
},
"required": ["order_id", "reason"]
}
}
}
]
The execution flow
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
)
# Model decided to call a function
if response.choices[0].finish_reason == "tool_calls":
tool_call = response.choices[0].message.tool_calls[0]
# Execute the function
if tool_call.function.name == "get_order_status":
args = json.loads(tool_call.function.arguments)
result = order_service.get_status(args["order_id"])
# Send result back to model
messages.append(response.choices[0].message)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(result)
})
# Get final response
final = client.chat.completions.create(model="gpt-4o", messages=messages)
Parallel and sequential tool calls
Parallel calls
When the model needs multiple independent pieces of information:
User: "What's the weather in Tokyo and New York?"
Model generates TWO tool calls simultaneously:
- get_weather(location="Tokyo")
- get_weather(location="New York")
Execute both in parallel, return both results, and the model synthesizes them into one response.
Sequential calls
When the second call depends on the first:
User: "Find my most recent order and check its status"
Step 1: get_recent_orders(user_id="usr_123")
→ Returns: [{"order_id": "ORD-789", ...}]
Step 2: get_order_status(order_id="ORD-789")
→ Returns: {"status": "shipped", ...}
The model cannot make the second call until it sees the result of the first.
graph LR
subgraph parallel["Parallel Execution"]
P1["Call A"] --> PR["Both results"]
P2["Call B"] --> PR
PR --> RESP["Single response"]
end
subgraph sequential["Sequential Execution"]
S1["Call A"] --> S2["Result A"]
S2 --> S3["Call B (uses A's result)"]
S3 --> S4["Result B"]
S4 --> RESP2["Final response"]
end
style P1 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style P2 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style S1 fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style S3 fill:#FAEEDA,stroke:#854F0B,color:#633806
Designing effective tool interfaces
Rule 1: Clear, specific descriptions
The model selects tools based on descriptions. Vague descriptions cause wrong tool selection:
# BAD: too vague
{"name": "search", "description": "Search for things"}
# GOOD: specific about what and when
{"name": "search_product_catalog",
"description": "Search the product catalog by name, category, or SKU. Use when the user asks about product availability, pricing, or specifications."}
Rule 2: Constrained parameters
Use enums, patterns, and required fields to reduce errors:
# BAD: freeform string
{"type": "string", "description": "The priority level"}
# GOOD: constrained enum
{"type": "string", "enum": ["low", "medium", "high", "critical"]}
Rule 3: Return structured, minimal results
Tool results should be concise and structured. Do not return entire HTML pages or 10KB JSON blobs:
# BAD: returns raw API response with 50 fields
return raw_api_response
# GOOD: returns only what the model needs
return {
"status": "shipped",
"tracking_number": "FX123456",
"estimated_delivery": "2024-03-15",
"carrier": "FedEx"
}
Rule 4: Include error handling in descriptions
Tell the model what happens on failure:
{
"name": "get_order_status",
"description": "Look up order status. Returns error if order_id is invalid or not found. In that case, ask the user to verify their order number."
}
Rule 5: Separate read from write operations
Keep query tools (safe, repeatable) separate from mutation tools (dangerous, irreversible):
# Read: safe to call anytime
"get_order_status" # No side effects
"search_products" # No side effects
# Write: require confirmation
"initiate_refund" # Moves money
"cancel_order" # Irreversible
"send_email" # External communication
Where function calling breaks
Hallucinated function calls
The model invents a function that does not exist or passes parameters that do not match the schema. Modern models with structured tool calling (OpenAI, Anthropic) rarely do this, but it still happens with weaker models or complex schemas.
Wrong tool selection
Given 20+ tools, the model picks the wrong one. A user says “check my balance” and the model calls get_account_settings instead of get_account_balance because the descriptions are not distinct enough.
Parameter extraction failures
The model misinterprets the user’s input when filling parameters:
- User says “last Tuesday” → model needs to compute the actual date
- User gives a partial order number → model guesses the rest
- User provides ambiguous input → model picks the wrong interpretation
Over-eagerness
The model calls tools when it should ask for clarification: “Refund my order” without specifying which order triggers the model to guess rather than ask.
Under-eagerness
The model answers from memory when it should call a tool: “What’s your return policy?” gets a hallucinated answer instead of calling get_policy("returns").
Real-world function calling systems
- OpenAI GPT-4 - native function calling with JSON schema validation, parallel tool calls
- Anthropic Claude - tool use protocol with explicit thinking before tool calls
- Google Gemini - function calling with automatic parameter extraction
- ChatGPT Plugins (deprecated) → GPTs - tool use connecting ChatGPT to third-party APIs
- Stripe AI - agents that call Stripe APIs to manage payments, subscriptions, and disputes
How to apply in practice
Start with 3-5 tools and expand. Validate the model reliably selects and uses a small set before adding more. Each new tool increases selection ambiguity.
Validate tool call arguments before execution. Do not trust the model’s generated parameters. Validate types, ranges, permissions, and business logic before calling your backend.
Implement confirmation for destructive actions. Before executing delete_account() or send_payment(), return a confirmation message to the user and require explicit approval.
Monitor tool call patterns. Track which tools are called, how often they fail, and what errors occur. This telemetry reveals which tool descriptions need improvement and where the model is confused.
Use function calling for structured extraction too. Even when you do not have a “real” tool to call, defining a function schema is the most reliable way to extract structured data:
# "Tool" that just structures the extraction
{"name": "extract_contact_info",
"parameters": {"name": "string", "email": "string", "phone": "string"}}
FAQ
Q: How many tools can I give a model before performance degrades?
Empirically, most models handle 10-20 tools well. Beyond 20, tool selection accuracy drops and the model may ignore tools it should use or select wrong ones. If you need more tools, implement tool routing: classify the user’s intent first, then provide only the relevant 5-10 tools for that intent category. Some models (GPT-4, Claude) handle more tools than others.
Q: Should I let the model decide when to use tools, or should my code force tool usage?
Let the model decide for conversational interfaces where the user might ask things that do not require tools. Force tool usage (via tool_choice parameter) when you know the user’s intent always requires a specific tool call - like a “check status” button that should always call get_status. Forcing eliminates the “under-eager” failure mode.
Q: What is the difference between function calling and MCP (Model Context Protocol)?
Function calling is a model-level protocol: you tell the model what tools exist, the model generates structured calls, you execute them. MCP is a system-level protocol: it standardizes how AI applications discover, connect to, and use tool servers. Think of function calling as the “calling convention” and MCP as the “service discovery and transport layer.” MCP enables tools to be provided by external servers rather than hardcoded in your application.
Interview questions
Q: Design the tool interface for an AI assistant that helps manage a team’s project board (create tasks, assign, update status, comment). What tools would you define and what safety measures would you implement?
Tools: (1) list_tasks(filters: status, assignee, priority), (2) get_task(task_id), (3) create_task(title, description, assignee, priority), (4) update_task_status(task_id, new_status), (5) add_comment(task_id, comment), (6) assign_task(task_id, assignee). Safety: require confirmation for create/update/assign actions (show what will change, ask “proceed?”). Validate assignee exists in the team. Prevent status transitions that violate workflow (cannot go from “done” back to “todo” without explanation). Rate limit: max 5 mutations per conversation to prevent runaway agents. Permissions: check that the user has permission to modify the board before exposing mutation tools.
Q: Your function calling agent calls the wrong tool 15% of the time. Users ask about “billing” and it calls get_usage_stats instead of get_billing_info. How do you diagnose and fix?
Diagnosis: log all tool selections with the user query and compare actual vs expected tool. Look for patterns in mis-selections - likely the tool descriptions are too similar or too vague. Fixes: (1) Rewrite descriptions to be more distinct - explicitly state “Do NOT use this for billing questions” in the usage stats description. (2) Add “when to use” and “when NOT to use” guidance in descriptions. (3) Add few-shot examples in the system prompt showing correct tool selection for common queries. (4) If tools are genuinely confusable, merge them into one tool with a mode parameter. (5) Add query classification: detect “billing” intent before the LLM call and pre-filter tools to only billing-related ones. Monitor: track selection accuracy per tool and per query category to measure improvement.