AI Engineering
Tool Calling / Function Calling: LLMs Controlling External Systems
Цели урока
- Understand the tool calling mechanism: how LLMs generate function call requests instead of text
- Learn to describe tools via JSON Schema with effective descriptions
- Implement the full tool calling cycle: request → tool_call → execute → respond
- Handle parallel tool calls via Promise.all for optimal performance
- Build a production-ready handler with validation, timeouts, and rate limiting
Предварительные знания
- Structured Output and JSON Schema
- OpenAI Chat Completions API
Tool calling is the moment when an LLM stops being a text generator and becomes an agent. The model does not execute code. It says: "call this with these arguments." The distinction is central - and the entire security model lives there. June 2023: OpenAI launches function calling. Schick et al. (Toolformer, 2023) had just shown that models can self-supervised learn to use tools - but without a standard API it was a lab result. Function calling made it a product. Within a year - the standard of the entire industry.
- ChatGPT Plugins and GPT Actions all work through tool calling: from restaurant reservations to flight bookings - every action is a JSON request from the model
- GitHub Copilot Workspace calls tools to edit files, run tests, create PRs - tool calling as the foundation of an agentic interface
- Stripe AI Assistant calls 50+ internal APIs via function calling - processes refunds, checks payments, generates reports
- Cursor IDE - all access to the codebase, file system, and terminal goes through tool calling: the model has no direct access, only through explicit calls
From Descriptions to Actions
**February 2023**: Schick et al. publish Toolformer - the first paper on self-supervised training of LLMs to use tools. The model learns on its own when and how to call an API, a calculator, a search engine. Academically convincing, but without a standardized interface. **June 2023**: OpenAI launches function calling - the first mass-market API for tool use. Overnight, hundreds of teams gain the ability to connect an LLM to real systems. **Late 2023**: Anthropic ships tool use, Google ships function calling. Parallel tool calls arrive with GPT-4o. **2024**: tool_choice, structured tool outputs, Anthropic's MCP (Model Context Protocol) - standardization at the protocol level.
What Is Tool Calling: LLM as Dispatcher
Tool calling is the moment when an LLM stops being a text generator and becomes an agent. Before June 2023, GPT-4 could only describe actions: "I would check the order status." A Shopify developer connected the model to a CRM - the bot handled 12,000 requests in a week and completed none of them. Text descriptions without execution - that is exactly what an LLM is without tool calling.
The model does not execute the function itself. It says: "call this with these arguments." The distinction is fundamental - and the entire security model lives there. Execution stays on the backend side: check permissions, validate arguments, log, apply rate limits. The LLM only generates JSON with the function name and its parameters - a structured request instead of text.
| Without Tool Calling | With Tool Calling |
|---|---|
| Model generates text "I would check..." | Model returns JSON: { name: 'check_order', arguments: {...} } |
| Backend can't parse the intent | Backend receives a typed function call |
| The user performs the action themselves | The system automatically performs the action |
| Hallucinations in the response format | Strict JSON Schema guarantees the format |
June 2023 - OpenAI launches function calling. Before that, researchers were experimenting: Schick et al. 2023 (Toolformer) showed that models could learn to use tools via self-supervised training - but without a standardized API, it remained a lab result. OpenAI function calling was the first mass-market interface. Anthropic followed with tool use, Google with function calling. Within a year it became the industry standard.
The LLM calls the function directly - like invoking a method in code
The model only generates JSON: function name + arguments. No execution. The backend decides whether to run it at all
This is not just semantics - it is an architectural security boundary. The model has no access to the runtime environment. It does not know whether `delete_account` exists in production, whether the user has permissions, or whether the rate limit has passed. It sees the JSON Schema with descriptions and decides: "this is what should be called." The backend receives that JSON, validates it, checks permissions, logs - and only then executes. That is exactly why tool calling is safer than any code execution: every call passes through explicit backend code.
What does the LLM return during tool calling instead of a regular text response?
Defining Tools: JSON Schema and Descriptions
A tool description is a prompt. Not technical documentation, not a code comment. A prompt the model uses to decide: should this function be called, with what arguments, in what context. A Microsoft study (2024) showed that one additional example sentence in a description raises call accuracy from 73% to 91%. An 18-percentage-point gap - from a single sentence.
Rules for Writing Effective Descriptions
- **name** - verb + noun: `search_products`, `create_order`, `get_user_balance`. Not `do_thing` or `handler`
- **function description** - 2-3 sentences: what it does, what it returns, when to use. The description is a prompt for the model
- **parameter descriptions** - include example values: 'e.g. "red sneakers"'. For enum - describe when to choose each value
- **required** - only truly required parameters. Optional parameters with "when to specify" descriptions improve accuracy
- **enum instead of string** - if the set of values is limited, use enum. This eliminates hallucinations
Don't add more than 20 tools to a single request. Research showed: with 5 tools accuracy = 97%, with 20 tools = 88%, with 50 tools = 64%. If many tools are needed - group them and use a two-stage approach: first the model picks a category, then a specific tool from that category.
Which description for the function get_user_balance is most effective?
The Tool Calling Cycle: request → tool_call → execute → respond
The most common mistake when first encountering tool calling: a developer receives a tool_call from the model and thinks that is it. Writes a response to the user. But the model has not yet seen the function result. It said "call get_weather with London" - but has no idea what the call returned. Tool calling is a cycle of at least two LLM requests.
Full Cycle Diagram
Note the `tool_call_id` - it is a required field. Each tool result must be linked to a specific tool_call via its id. Without this, the API returns an error. With parallel tool calls (multiple calls at once), each result must have its own unique tool_call_id.
What is the MINIMUM number of LLM API requests required for a complete tool calling cycle (from user question to final answer)?
Parallel Tool Calls: Multiple Functions in One Request
"Compare the weather in Moscow and London" - a reasonable request. The naive implementation: call get_weather twice sequentially, two round-trips to the LLM. But the model is smarter: it returns **both tool_calls in a single response** - parallel tool calls. This feature shipped with GPT-4o and immediately cut average latency by 35-40% for multi-tool requests.
The `parallel_tool_calls` API parameter controls behavior: `true` (default) - the model can return multiple tool_calls, `false` - strictly one at a time. Disabling is useful when tools have dependencies: the result of the first is needed for the second.
| Scenario | parallel_tool_calls | Why |
|---|---|---|
| Weather in 3 cities simultaneously | true | Requests are independent - can execute in parallel |
| Find user → get their orders | false | The second call depends on the result of the first |
| Search products + check balance | true | Independent requests to different systems |
| Create order → send notification | false | Notification is needed only after successful order creation |
The user asks: "What's the weather in Moscow and how much does an umbrella cost in the catalog?" The model returned 2 tool_calls: get_weather and search_products. How should the backend handle them?
Error Handling and Argument Validation
Phantom function - a real production failure class. GPT-4 called `delete_user_account` instead of `get_user_account` - both start with `user_account`, and model routing via descriptions broke. In the 2 hours before detection, 37 accounts were deleted. Tool calling is only as safe as the code around it. The model generates JSON - the backend decides whether to execute.
Protection Against Dangerous Calls
| Protection | What it prevents | Implementation |
|---|---|---|
| Function name validation | Calling a nonexistent or dangerous function | Whitelist of allowed names |
| Zod argument validation | SQL injection, invalid data | Typed schemas with constraints |
| Execution timeout | Hanging on a slow API | Promise.race with timeout |
| Per-tool rate limiting | DDoS via the bot ("call search 1000 times") | Call counter per tool per session |
| Confirmation for dangerous tools | Accidental deletion/modification of data | Pause before create_return, delete_account |
| Iteration limit | Infinite tool call loop | MAX_TOOL_ITERATIONS = 5 |
When returning an error in the tool result - the model usually tries to fix the call. For example, with an invalid order_id, the model will ask the user to clarify the number. This is **desired behavior** - do not throw exceptions, return error descriptions in the tool result's content field.
If a tool is described in schema, the LLM will call it correctly. Validation and error handling are the client code's job, not the model's.
LLMs routinely produce invalid arguments, confuse similar functions, and skip required fields. Production systems must validate every tool call before execution and return structured errors in a loop so the model can self-correct.
A schema is read as a contract the LLM is obliged to honour - an intuition imported from strictly typed APIs. In reality the LLM emits JSON as text, and phantom functions, missing required fields and type mismatches occur in 5-15% of calls even on GPT-4. Without an error loop those failures turn into production incidents.
The LLM called create_return with an invalid order_id. How should the backend respond?
The LLM calls the function itself - like invoking a method in code
The model only generates JSON: function name + arguments. Execution is entirely on the backend side
This is not just semantics - it is a security boundary. The model has no access to the runtime environment. It does not know whether `delete_account` exists in production, whether the user has permissions, whether the rate limit has been reached. It sees the JSON Schema with descriptions and decides: "this is what should be called." The backend receives that JSON, validates it, checks permissions, logs - and only then executes. That is exactly why tool calling is safer than any code execution: every call passes through explicit backend code.
Key Takeaways
- Tool calling is the boundary between LLM-as-text and LLM-as-agent: the model generates a JSON request, the backend executes
- Tool description is a prompt: one example sentence raises accuracy from 73% to 91%
- The cycle requires at least 2 requests: the first returns the tool_call, the second (with the result) returns the final answer
- Parallel tool calls via Promise.all: -35-40% latency for independent calls
- Production stack: Zod validation + whitelist + timeouts + rate limits + MAX_TOOL_ITERATIONS = 5
Вопросы для размышления
- Why should the backend decide whether to execute a tool call? What would change if the model could execute functions directly?
- A tool description is a prompt for the model. How does that change the approach to writing descriptions compared to regular documentation?
- Parallel tool calls reduce latency. But in which scenarios is call order critical - and how does that affect architecture?
What's Next
Tool calling is a single function call. But what if a task requires a chain of calls with reasoning in between? That is an AI agent - a system that plans, reasons, and acts in a loop.
- Agent Fundamentals — Tool calling in a reasoning loop - ReAct pattern, planning, memory for agents
- Agent Frameworks — LangGraph, CrewAI, AutoGen - frameworks that abstract tool calling into agent systems
- Error Handling for LLMs — Deeper into error handling: retries, fallbacks, graceful degradation for LLM systems
Связанные уроки
- aie-07-structured-output — Tool call is structured output to JSON schema
- aie-17-agent-fundamentals — Tool calling is the atomic primitive of agents
- aie-15-conversation-memory — Agent with memory needs both memory + tools
- aie-19-multi-agent — Multi-agent: agents call each other as tools
- bt-09-grpc — Tool schema is IDL like gRPC Protobuf
- st-01-feedback-loops — ReAct observe-think-act is a classic feedback loop
- net-64-api-gateway