AI Engineering

Tool Calling / Function Calling: LLMs Controlling External Systems

Цели урока

  • Understand the tool calling mechanism: how LLMs generate function call requests instead of text
  • Learn to describe tools via JSON Schema with effective descriptions
  • Implement the full tool calling cycle: request → tool_call → execute → respond
  • Handle parallel tool calls via Promise.all for optimal performance
  • Build a production-ready handler with validation, timeouts, and rate limiting

Предварительные знания

  • Structured Output and JSON Schema
  • OpenAI Chat Completions API
  • Structured Output
  • API Integration

Tool calling is the moment when an LLM stops being a text generator and becomes an agent. The model does not execute code. It says: "call this with these arguments." The distinction is central - and the entire security model lives there. June 2023: OpenAI launches function calling. Schick et al. (Toolformer, 2023) had just shown that models can self-supervised learn to use tools - but without a standard API it was a lab result. Function calling made it a product. Within a year - the standard of the entire industry.

  • ChatGPT Plugins and GPT Actions all work through tool calling: from restaurant reservations to flight bookings - every action is a JSON request from the model
  • GitHub Copilot Workspace calls tools to edit files, run tests, create PRs - tool calling as the foundation of an agentic interface
  • Stripe AI Assistant calls 50+ internal APIs via function calling - processes refunds, checks payments, generates reports
  • Cursor IDE - all access to the codebase, file system, and terminal goes through tool calling: the model has no direct access, only through explicit calls

From Descriptions to Actions

**February 2023**: Schick et al. publish Toolformer - the first paper on self-supervised training of LLMs to use tools. The model learns on its own when and how to call an API, a calculator, a search engine. Academically convincing, but without a standardized interface. **June 2023**: OpenAI launches function calling - the first mass-market API for tool use. Overnight, hundreds of teams gain the ability to connect an LLM to real systems. **Late 2023**: Anthropic ships tool use, Google ships function calling. Parallel tool calls arrive with GPT-4o. **2024**: tool_choice, structured tool outputs, Anthropic's MCP (Model Context Protocol) - standardization at the protocol level.

What Is Tool Calling: LLM as Dispatcher

Tool calling is the moment when an LLM stops being a text generator and becomes an agent. Before June 2023, GPT-4 could only describe actions: "I would check the order status." A Shopify developer connected the model to a CRM - the bot handled 12,000 requests in a week and completed none of them. Text descriptions without execution - that is exactly what an LLM is without tool calling.

The model does not execute the function itself. It says: "call this with these arguments." The distinction is fundamental - and the entire security model lives there. Execution stays on the backend side: check permissions, validate arguments, log, apply rate limits. The LLM only generates JSON with the function name and its parameters - a structured request instead of text.

Without Tool CallingWith Tool Calling
Model generates text "I would check..."Model returns JSON: { name: 'check_order', arguments: {...} }
Backend can't parse the intentBackend receives a typed function call
The user performs the action themselvesThe system automatically performs the action
Hallucinations in the response formatStrict JSON Schema guarantees the format

June 2023 - OpenAI launches function calling. Before that, researchers were experimenting: Schick et al. 2023 (Toolformer) showed that models could learn to use tools via self-supervised training - but without a standardized API, it remained a lab result. OpenAI function calling was the first mass-market interface. Anthropic followed with tool use, Google with function calling. Within a year it became the industry standard.

The LLM calls the function directly - like invoking a method in code

The model only generates JSON: function name + arguments. No execution. The backend decides whether to run it at all

This is not just semantics - it is an architectural security boundary. The model has no access to the runtime environment. It does not know whether `delete_account` exists in production, whether the user has permissions, or whether the rate limit has passed. It sees the JSON Schema with descriptions and decides: "this is what should be called." The backend receives that JSON, validates it, checks permissions, logs - and only then executes. That is exactly why tool calling is safer than any code execution: every call passes through explicit backend code.

What does the LLM return during tool calling instead of a regular text response?

Defining Tools: JSON Schema and Descriptions

A tool description is a prompt. Not technical documentation, not a code comment. A prompt the model uses to decide: should this function be called, with what arguments, in what context. A Microsoft study (2024) showed that one additional example sentence in a description raises call accuracy from 73% to 91%. An 18-percentage-point gap - from a single sentence.

Rules for Writing Effective Descriptions

  • **name** - verb + noun: `search_products`, `create_order`, `get_user_balance`. Not `do_thing` or `handler`
  • **function description** - 2-3 sentences: what it does, what it returns, when to use. The description is a prompt for the model
  • **parameter descriptions** - include example values: 'e.g. "red sneakers"'. For enum - describe when to choose each value
  • **required** - only truly required parameters. Optional parameters with "when to specify" descriptions improve accuracy
  • **enum instead of string** - if the set of values is limited, use enum. This eliminates hallucinations

Don't add more than 20 tools to a single request. Research showed: with 5 tools accuracy = 97%, with 20 tools = 88%, with 50 tools = 64%. If many tools are needed - group them and use a two-stage approach: first the model picks a category, then a specific tool from that category.

Which description for the function get_user_balance is most effective?

The Tool Calling Cycle: request → tool_call → execute → respond

The most common mistake when first encountering tool calling: a developer receives a tool_call from the model and thinks that is it. Writes a response to the user. But the model has not yet seen the function result. It said "call get_weather with London" - but has no idea what the call returned. Tool calling is a cycle of at least two LLM requests.

Full Cycle Diagram

Note the `tool_call_id` - it is a required field. Each tool result must be linked to a specific tool_call via its id. Without this, the API returns an error. With parallel tool calls (multiple calls at once), each result must have its own unique tool_call_id.

What is the MINIMUM number of LLM API requests required for a complete tool calling cycle (from user question to final answer)?

Parallel Tool Calls: Multiple Functions in One Request

"Compare the weather in Moscow and London" - a reasonable request. The naive implementation: call get_weather twice sequentially, two round-trips to the LLM. But the model is smarter: it returns **both tool_calls in a single response** - parallel tool calls. This feature shipped with GPT-4o and immediately cut average latency by 35-40% for multi-tool requests.

The `parallel_tool_calls` API parameter controls behavior: `true` (default) - the model can return multiple tool_calls, `false` - strictly one at a time. Disabling is useful when tools have dependencies: the result of the first is needed for the second.

Scenarioparallel_tool_callsWhy
Weather in 3 cities simultaneouslytrueRequests are independent - can execute in parallel
Find user → get their ordersfalseThe second call depends on the result of the first
Search products + check balancetrueIndependent requests to different systems
Create order → send notificationfalseNotification is needed only after successful order creation

The user asks: "What's the weather in Moscow and how much does an umbrella cost in the catalog?" The model returned 2 tool_calls: get_weather and search_products. How should the backend handle them?

Error Handling and Argument Validation

Phantom function - a real production failure class. GPT-4 called `delete_user_account` instead of `get_user_account` - both start with `user_account`, and model routing via descriptions broke. In the 2 hours before detection, 37 accounts were deleted. Tool calling is only as safe as the code around it. The model generates JSON - the backend decides whether to execute.

Protection Against Dangerous Calls

ProtectionWhat it preventsImplementation
Function name validationCalling a nonexistent or dangerous functionWhitelist of allowed names
Zod argument validationSQL injection, invalid dataTyped schemas with constraints
Execution timeoutHanging on a slow APIPromise.race with timeout
Per-tool rate limitingDDoS via the bot ("call search 1000 times")Call counter per tool per session
Confirmation for dangerous toolsAccidental deletion/modification of dataPause before create_return, delete_account
Iteration limitInfinite tool call loopMAX_TOOL_ITERATIONS = 5

When returning an error in the tool result - the model usually tries to fix the call. For example, with an invalid order_id, the model will ask the user to clarify the number. This is **desired behavior** - do not throw exceptions, return error descriptions in the tool result's content field.

If a tool is described in schema, the LLM will call it correctly. Validation and error handling are the client code's job, not the model's.

LLMs routinely produce invalid arguments, confuse similar functions, and skip required fields. Production systems must validate every tool call before execution and return structured errors in a loop so the model can self-correct.

A schema is read as a contract the LLM is obliged to honour - an intuition imported from strictly typed APIs. In reality the LLM emits JSON as text, and phantom functions, missing required fields and type mismatches occur in 5-15% of calls even on GPT-4. Without an error loop those failures turn into production incidents.

The LLM called create_return with an invalid order_id. How should the backend respond?

The LLM calls the function itself - like invoking a method in code

The model only generates JSON: function name + arguments. Execution is entirely on the backend side

This is not just semantics - it is a security boundary. The model has no access to the runtime environment. It does not know whether `delete_account` exists in production, whether the user has permissions, whether the rate limit has been reached. It sees the JSON Schema with descriptions and decides: "this is what should be called." The backend receives that JSON, validates it, checks permissions, logs - and only then executes. That is exactly why tool calling is safer than any code execution: every call passes through explicit backend code.

Key Takeaways

  • Tool calling is the boundary between LLM-as-text and LLM-as-agent: the model generates a JSON request, the backend executes
  • Tool description is a prompt: one example sentence raises accuracy from 73% to 91%
  • The cycle requires at least 2 requests: the first returns the tool_call, the second (with the result) returns the final answer
  • Parallel tool calls via Promise.all: -35-40% latency for independent calls
  • Production stack: Zod validation + whitelist + timeouts + rate limits + MAX_TOOL_ITERATIONS = 5

Вопросы для размышления

  • Why should the backend decide whether to execute a tool call? What would change if the model could execute functions directly?
  • A tool description is a prompt for the model. How does that change the approach to writing descriptions compared to regular documentation?
  • Parallel tool calls reduce latency. But in which scenarios is call order critical - and how does that affect architecture?

What's Next

Tool calling is a single function call. But what if a task requires a chain of calls with reasoning in between? That is an AI agent - a system that plans, reasons, and acts in a loop.

  • Agent Fundamentals — Tool calling in a reasoning loop - ReAct pattern, planning, memory for agents
  • Agent Frameworks — LangGraph, CrewAI, AutoGen - frameworks that abstract tool calling into agent systems
  • Error Handling for LLMs — Deeper into error handling: retries, fallbacks, graceful degradation for LLM systems

Связанные уроки

  • aie-07-structured-output — Tool call is structured output to JSON schema
  • aie-17-agent-fundamentals — Tool calling is the atomic primitive of agents
  • aie-15-conversation-memory — Agent with memory needs both memory + tools
  • aie-19-multi-agent — Multi-agent: agents call each other as tools
  • bt-09-grpc — Tool schema is IDL like gRPC Protobuf
  • st-01-feedback-loops — ReAct observe-think-act is a classic feedback loop
  • net-64-api-gateway
Tool Calling / Function Calling: LLMs Controlling External Systems

0

1

Sign In