Article

Tool Use: How Agents Take Action

Tool use is how an agent leaves pure text generation and interacts with external systems. Reliable tool use depends on more than choosing a function name. It depends on arguments, execution control, permissions, and verification.

Tool use is how an agent turns reasoning into action.

That is the short answer.

If you want the more operational version, use this:

Tool use is the part of an agent system that lets the model request capabilities outside itself, and lets the runtime execute those requests against real data, software, and environments.

Without tools, an agent can still explain, summarize, plan, or simulate.

But it cannot reliably:

That is why tool use is one of the core components of agent engineering.

It is the boundary where the system moves from language generation into controlled interaction with the world.

Why Agents Need Tools

Planning alone does not let a system act.

As Planning and Task Decomposition explains, an agent can break a goal into sensible next steps. But a plan still remains theory until the system can inspect real state or cause a real change.

An agent can produce a perfect plan in text and still be useless if it has no way to inspect live state or carry out the next step.

Suppose an agent decides:

  1. check the status of the failed billing job
  2. inspect logs
  3. look for a recent config change
  4. retry the job if the dependency recovered

That is a reasonable plan.

But the plan only matters if the system can actually:

This is the difference between talking about work and doing work.

So tool use is not an optional extra attached to agent systems.

It is how the act step becomes real.

What Counts as a Tool

In agent systems, a tool is any external capability the runtime makes available to the model.

That capability might be:

The important point is not whether the implementation is called a function, tool, action, command, skill, or capability.

The important point is that the model cannot perform that operation by itself. It has to request it through a defined interface.

That interface tells the model:

So tool use is not magic.

It is an interface contract between the model and the application runtime.

What Actually Happens During a Tool Call

Many people imagine tool use as one event:

In practice, it is a small protocol.

The usual flow looks like this:

  1. the application provides the model with a set of tools and their schemas
  2. the model decides whether one of those tools is needed
  3. the model emits a tool call with arguments
  4. the application runtime validates and executes the call
  5. the tool result is returned to the model as new context
  6. the model continues reasoning, asks for another tool, or produces a final answer

That matters because the model is not directly reaching into your systems.

Your application is still the thing doing the real work.

That is the trust boundary:

the model can propose an action, but the runtime must decide whether that action is valid, allowed, executable, and complete.

The runtime decides:

This is why tool use belongs to both model design and systems design.

A Running Example: Resolving a Failed Nightly Job

Suppose a user asks an operations agent:

Figure out why the nightly billing job failed and take the safest next action.

The agent may need tools such as:

Now look at what has to go right.

The system does not only need the model to say use retry_job.

It also needs:

If the model chooses the correct tool but forms the wrong arguments, or if the runtime executes the action without checking whether it is allowed, the system still fails.

That is the key lesson.

Tool selection is only the beginning.

The Hard Part Is Not Just Choosing a Tool

When teams first add tool use, they often focus on the visible moment where the model names a function.

That is too narrow.

A large share of real failures happen after the right tool was broadly chosen.

For example:

This is why tool reliability is not mainly a model-intelligence question.

It is an interface-quality question.

Good tool use depends on:

The Action Contract

The simplest way to evaluate a tool step is to treat it as a contract.

Use this test before trusting an agent action.

1. What Capability Is Being Requested?

The tool should describe one clear capability.

If the tool definition is vague, overlapping, or does too many things, the model has to guess when to use it.

2. What Exact Inputs Does Execution Require?

The system should know which inputs are required, which are optional, and which values should never be guessed.

If the runtime already knows a value, it should usually inject it itself instead of making the model reconstruct it.

3. What Limits or Approvals Apply Before Execution?

Not every tool call should execute just because the model emitted it.

Some actions need:

4. What Signals Prove the Action Worked or Failed?

The system should know how success is measured.

A tool call returning without an exception is not always proof that the intended outcome happened.

That is the Action Contract.

A tool step is much stronger when the capability, inputs, limits, and verification signals are all explicit.

Apply that to retry_job in the nightly-job example:

If one of those four pieces is unclear, the call is weaker than it looks.

Why Argument Formation Matters So Much

A tool call is not useful if the arguments are weak.

This is one of the easiest places to underestimate the engineering work.

Go back to the nightly-job example.

A poor retry_job interface might ask the model to provide:

all in one call, even though the application may already know some of those values or be able to derive them deterministically.

That design forces the model to synthesize too much.

A stronger system often moves some of that burden out of the model and into code.

For example:

This is a general rule.

Do not make the model guess inputs that the system already knows or can derive more reliably.

Validation, Permissions, and Bounded Execution

This is where tool use stops being a product demo and becomes engineering.

If the model emits a call, the runtime should still be able to say:

That means safe tool use usually needs more than a tool schema.

It needs execution boundaries.

Common boundaries include:

This does not make the system less agentic.

It makes the system governable.

Tool Results Are New Observations, Not Final Truth

A tool result should feed the loop back into sensing and reasoning.

That means the agent should treat the result as new state to interpret.

For example:

So a good agent does not think:

I called the tool, therefore the task is done.

It thinks:

I have new evidence. What does this change about the next step?

This is one reason tool use connects directly to The Sense-Think-Act Loop. Tool results become new sensed state. They also connect to memory, because the system may need to preserve what was tried, what changed, and what still remains unresolved.

Tool Use Is Not the Same as a Workflow Step

This matters because both involve actions, but they solve different problems.

A workflow step is usually predefined by the designer. That is the same core distinction behind LLMs, Workflows, and Agents: What Actually Changes?: action capability is not the same thing as control structure.

Tool use is the action capability available to the runtime.

A workflow might say:

  1. check account status
  2. check refund eligibility
  3. ask for approval if needed
  4. issue refund

Inside that workflow, the actual operations may still happen through tools.

In a more agentic system, the runtime may decide:

So tool use is part of the action layer.

Workflows and plans decide how that layer is used.

What Strong Tool Use Looks Like

A well-engineered tool layer usually has these properties:

The Tool Set Is Small Enough to Be Legible

Too many overlapping tools make selection worse.

Each Tool Has a Clear Job

Tools should not hide several unrelated capabilities behind one vague name.

Inputs Are Structured Around Real Needs

The schema should match what execution actually requires.

The Runtime Enforces Limits

Policies and approvals should live in code, not only in prompts.

Results Are Returned in a Way the Agent Can Reason About

The agent needs outputs that make success, failure, and ambiguity visible.

Every Action Has a Verification Path

If the system cannot check whether the action worked, it has not really closed the loop.

Why This Matters for the Next Articles

Once tool use is in place, several later topics become easier to understand.

Memory matters because the agent needs to remember prior actions, results, and unresolved state.

ReAct matters because tool use is one of the clearest cases of a think-act-observe loop.

Guardrails matter because the runtime needs bounded execution, typed outputs, and permission control.

Evaluation matters because a final answer can look good even when the trajectory used the wrong tools, passed weak arguments, or failed to verify success.

So tool use is not just another feature on the list.

It is the point where agent design starts touching reality.

FAQ

Can an agent still be useful without tools?

Yes, but only within narrower limits. A tool-less agent can still explain, brainstorm, summarize, or help structure work. What it cannot do reliably is inspect fresh system state, retrieve protected data, or take external actions.

Is tool use just function calling?

Function calling is one common interface mechanism for tool use, but the larger concept is broader. Tool use includes the whole runtime loop around capability exposure, argument formation, execution, result handling, and continued reasoning.

If every action goes through predefined tools, is the system still an agent?

Yes. Agents do not stop being agents because their action space is bounded. In practice, bounded action spaces are often what make agent behavior usable, auditable, and safe.

Why do tool calls fail even when the model picked the right tool?

Because the failure often happens at the next layer: missing arguments, invalid values, ambiguous results, unsafe execution, missing permissions, or weak post-action checks.

Should the model supply every tool argument?

No. If the runtime already knows a value or can derive it deterministically, it should usually do that itself. The model should be responsible mainly for the parts that actually require interpretation or choice.

How many tools should an agent have at once?

There is no universal number, but smaller and clearer tool sets are generally easier for the model to use well. Too many overlapping tools increase confusion and misfires.

Where should permissions and approvals live?

In the runtime layer, not only in the prompt. Prompts can express intent, but policy enforcement has to happen in code and system controls.

Are tool results always trustworthy?

No. A tool result is new evidence, not guaranteed truth. It may be partial, stale, ambiguous, or operationally incomplete, so the agent still needs to reason about what the result means.

How does tool use connect to memory?

Tool use creates state that may need to persist across steps or sessions: what was queried, what changed, what failed, and what still needs follow-up.

How does tool use connect to evaluation?

A strong evaluation does not only ask whether the final answer looked correct. It asks whether the system chose the right tools, formed good arguments, stayed within bounds, and verified the outcome properly.