Evaluating Agent Trajectories, Not Just Outputs

Most teams start agent evaluation in the wrong place.

They score the ending.

If the final answer looks correct, they call the run a success.

That works better for single-turn model calls than it does for agent systems.

Agents do not only answer.

They plan, retrieve, call tools, retry, recover, escalate, and cross execution boundaries.

That means a correct final answer can hide a bad run.

The real question is no longer:

Did the agent get the answer right?

It becomes:

Did the agent behave well enough to trust in production?

That is the job of trajectory evaluation.

This article builds on Planning and Task Decomposition, ReAct and the Basic Reasoning Loop, Tool Use: How Agents Take Action, Structured Outputs, Guardrails, and Execution Boundaries, Supervisor, Router, and Planner-Executor Patterns, and Human-in-the-Loop Control Design. Those pieces explain how agent systems think, act, coordinate, and stay bounded. This one explains how to judge whether the run itself was good.

The Trap of Final-Answer Scoring

The simplest failure mode is also the most dangerous:

the system looks correct at the end, so the team never notices that the path was weak.

Imagine an agent that prepares a customer refund.

It eventually returns the correct amount.

But along the way it:

checked the wrong account first
made failed tool calls before finding the right one
used stale context before correcting itself
proposed an out-of-policy action that a guardrail blocked
succeeded only after a late retry recovered the run

The ending looks fine.

The trajectory does not.

That run may still be:

too expensive
too slow
too brittle
too risky
too dependent on lucky recovery

This is why output-only evaluation breaks for agents.

A correct result does not prove a trustworthy process.

Output Quality vs Process Quality

This distinction has to be explicit.

Output quality asks whether the final answer, action, or completed task was correct.

Process quality asks whether the agent took a sensible, safe, and efficient path to get there.

Those are related.

They are not the same.

Output quality tells you whether the destination was acceptable.

Process quality tells you whether the system should be trusted to travel that route again.

For agent systems, both matter.

If you grade only the destination, you miss:

silent failures
hidden brittleness
repeated retries
tool misuse
recoveries that only worked by luck
policy violations that were blocked downstream instead of avoided upstream

That last case matters more than many teams realize.

A blocked unsafe action is not proof of a good run.

It may be proof that a downstream control rescued a bad run before it became an incident.

What a Trajectory Actually Is

A trajectory is the path the agent takes through the task.

In practice, that means the sequence of:

reasoning steps
tool calls
observations
retries
corrections
handoffs
pauses
completion decisions

If ReAct and the Basic Reasoning Loop explains the local thought-action-observation cycle, trajectory evaluation looks at the full run those local cycles create.

If Planning and Task Decomposition explains how the work is broken down, trajectory evaluation asks whether that decomposition actually produced a trustworthy path.

If Supervisor, Router, and Planner-Executor Patterns explains who controls the system, trajectory evaluation asks whether that orchestration produced good behavior.

That is why this topic sits naturally after control design.

Once an agent can act over time, the run becomes part of the product.

What Deserves Evaluation Inside the Run

Not every agent needs the same depth of review.

If the system is basically a single bounded response with no real action surface, output quality may still do most of the work.

Trajectory evaluation matters most when the system:

acts over multiple steps
calls tools or external services
can create side effects
may recover from intermediate mistakes before producing the final answer

For those systems, four dimensions matter again and again:

Dimension	What you inspect	What failure looks like
`Sequence quality`	Did the run follow a sensible path?	loops, wasted steps, wrong order, avoidable detours
`Tooling quality`	Did it choose and use tools correctly?	wrong tool, bad arguments, misread tool output
`Recovery quality`	Did it handle failure well?	repeated retries, no adaptation, brittle collapse
`Policy fit`	Did it stay inside the allowed operating boundary?	unsafe proposals, permission drift, skipped checkpoints

There is also one cross-cutting issue that deserves explicit attention:

state consistency.

Did the agent stay aligned with the actual state of the world as the run progressed?

This is where a lot of trajectories quietly degrade.

An agent may:

keep reasoning from stale context after a tool result changed the situation
forget that an earlier step already failed
continue planning from an outdated assumption
produce a locally coherent answer that is globally inconsistent with the current task state

State consistency does not need to become a fifth headline dimension every time, but it should be checked across the four that already matter.

Weak state consistency usually shows up as:

poor sequence decisions
misuse of tool outputs
bad retries
false confidence after the environment has already changed

This is the key shift.

You are no longer only grading what the agent said at the end.

You are grading how it behaved while it was acting.

The S.T.E.P. Trajectory Review

A simple way to inspect an agent run is the S.T.E.P. Trajectory Review.

S.T.E.P. stands for:

Sequence
Tooling
Error Recovery
Policy Fit

This is not a vendor framework.

It is a compact way to inspect the run itself.

Applied to the refund example above, the review would look like this:

Sequence: did the agent check the customer and refund context in a sensible order, or did it wander through avoidable steps first?
Tooling: did it call the right account and refund tools with the right arguments, or did it misuse them and recover later by luck?
Error Recovery: once the first account lookup failed, did it adapt intelligently, or just keep retrying until something worked?
Policy Fit: even if it found the right amount, did it propose any action that should have required a checkpoint or crossed the allowed operating boundary?

That is the practical point of the framework.

It gives you a way to judge the same run from four different angles instead of collapsing everything into one outcome score.

Sequence

Did the agent take a sensible route through the task?

Good sequence quality usually means:

the task was decomposed reasonably
the order of operations made sense
the run converged instead of wandering
extra steps were justified by new evidence rather than confusion

Bad sequence quality looks like:

circular loops
unnecessary substeps
checking things too late
acting before key information was confirmed

This is often where a run becomes expensive long before it becomes obviously wrong.

Tooling

Did the agent pick the right tool and use it correctly?

Good tooling quality usually means:

the tool choice matched the need
the arguments were correct
the result was interpreted correctly
the agent did not keep reaching for tools it did not need

Bad tooling quality looks like:

wrong tool selection
invented parameters
repeated failed calls
correct tool output followed by incorrect reasoning about what it means

This is where Tool Use: How Agents Take Action becomes evaluative instead of architectural.

Error Recovery

What happened after the run went off track?

Good recovery behavior usually means:

the agent noticed the failure
changed strategy instead of repeating itself
rechecked the state of the world
stopped when escalation or review was justified

Bad recovery behavior looks like:

blind retries
repeating the same failed action
compounding an early mistake
reaching the correct answer only by luck after several bad steps

Two runs may both succeed.

One may succeed cleanly.

The other may succeed after three fragile recoveries that will fail under slightly different conditions.

Those are not equally good runs.

Policy Fit

Did the run stay inside the allowed operating boundary?

This matters even when the final answer is right.

A run can be factually correct and still unacceptable if it:

proposed a forbidden action
crossed a permission boundary
used an untrusted source
skipped a required approval
ignored a required review or escalation point

This is where trajectory evaluation connects directly back to Structured Outputs, Guardrails, and Execution Boundaries and Human-in-the-Loop Control Design.

The run must not only work.

It must work acceptably.

A Good Answer Can Still Come From a Bad Trajectory

This is the most important distinction in the article.

The final answer and the trajectory are related, but they are not interchangeable.

The simplest version is:

a good answer from a good trajectory is the goal
a bad answer from a bad trajectory is easy to reject
a bad answer from a good trajectory is often fixable
a good answer from a bad trajectory is dangerous because it creates false confidence

That last case is the one teams underestimate.

They think:

The system worked.

What actually happened may be:

The system survived this run.

Those are not the same thing.

Trajectory Evaluation Is Not the Same as Tracing

This distinction should stay clean because the next topic depends on it.

Trajectory evaluation asks:

What should be judged about the run?

Tracing and observability ask:

What evidence did we capture about the run, and can we inspect it later?

Tracing gives you the visibility layer:

tool-call history
timestamps
state transitions
intermediate outputs
retry chains
handoff visibility

Trajectory evaluation applies judgment to that evidence:

was the sequence sound?
was the tool use precise?
was the recovery behavior acceptable?
did the run stay inside policy?

So tracing is not the judgment.

It is the evidence layer that makes judgment possible.

But that still leaves an important practical question:

how do you actually score a trajectory once you have the trace?

There are three common approaches.

Reference Trajectories, Judge-Scored Trajectories, and Assertions

Once a team accepts that the run itself must be judged, the next question is:

what does the run get judged against?

In practice, most trajectory evaluation falls into three buckets.

1. Reference trajectories

This is the closest thing to answer-key evaluation for agent runs.

You define an expected path, or an expected set of acceptable steps, and compare the actual run against it.

That can be useful when:

the workflow is relatively stable
the correct order of operations is known
certain steps are mandatory
the domain is compliance-heavy or operationally strict

Examples:

the agent must verify identity before changing account settings
the agent must retrieve the latest record before drafting a response
the agent must call the approval service before executing a sensitive action

The strength of reference trajectories is clarity.

They make it obvious when the agent skipped a required step, did things in the wrong order, or touched the wrong part of the workflow.

The weakness is rigidity.

Many good agent runs are not identical.

If the environment is dynamic, there may be several acceptable routes to the same good outcome.

So reference matching works best when the path is constrained enough that you genuinely care about adherence, not just eventual success.

2. Judge-scored trajectories

Sometimes there is no single perfect route.

The run still needs to be judged, but the judgment has to be more flexible.

That is where judge-based evaluation comes in.

A human reviewer or an LLM-as-a-judge looks at the run and scores things like:

whether the sequence made sense
whether the tool choice was appropriate
whether the recovery behavior was intelligent
whether the run stayed grounded and in policy

This is often the right approach when:

multiple paths could be valid
the work is partly subjective
the task mixes reasoning, tool use, and judgment
the team is still learning what good and bad runs look like

The strength of judge-scored trajectories is flexibility.

The weakness is that the rubric matters enormously.

If the scoring prompt is vague, the evaluation becomes vague.

If the judge is not calibrated to human standards, the scores become theater.

So judge-based evaluation is useful, but only when the team has a clear rubric for what should count as:

effective progress
neutral exploration
harmful or invalid behavior

3. Programmatic assertions

Some parts of trajectory quality do not need subjective judgment at all.

They can be checked deterministically.

Examples:

the agent used an allowed tool
the tool arguments matched the schema
the approval step happened before execution
the retry count stayed within budget
the final output included a required field

This layer matters because it keeps evaluation from becoming entirely opinion-driven.

Strong agent evaluation usually combines all three:

reference checks where the path is known
judge-based scoring where the path is flexible
assertions where the system should be objectively testable

Milestones and Partial Credit Matter More Than Teams Expect

One weakness in naive trajectory evaluation is that it treats the run as either:

a total success
or a total failure

That is often too crude.

Many real tasks are multi-stage.

An agent may:

gather the right information
choose the right tools
make good progress through the task
and still fail in the final step

If you score only pass or fail, you lose the ability to see whether the system is improving.

That is why milestone-based evaluation matters.

Instead of judging only the ending, you define meaningful checkpoints inside the run.

For example, in a research workflow the milestones might be:

Identify the right sources
Extract the relevant facts
Resolve conflicts or ambiguities
Produce the final synthesis

In a customer-support workflow they might be:

Retrieve the right account
Verify policy constraints
Prepare the right action
Route to approval if needed
Complete or draft the resolution

This does two useful things.

First, it tells you whether the agent is failing early or late.

Second, it gives partial credit for meaningful progress instead of flattening every failure into the same undifferentiated bucket.

That matters operationally.

An agent that consistently reaches the right milestones and fails at the final step is a different engineering problem from an agent that goes off track immediately.

Milestones also work well with the S.T.E.P. lens.

At each checkpoint, you can ask:

was the sequence still sensible?
was the tool use still precise?
was the recovery behavior appropriate?
did the run remain in policy?

So milestone scoring is not a replacement for trajectory evaluation.

It is one of the clearest ways to make trajectory evaluation more informative.

What Data You Actually Need to Evaluate a Trajectory

One reason teams stay vague on trajectory evaluation is that they never define the minimum evidence needed to do it.

You do not need a perfect observability platform before you can start.

But you do need more than the final answer.

At minimum, a usable trajectory record should preserve:

the user request or task objective
the intermediate steps or decisions
tool calls and tool outputs
retry attempts
the evolving task state or relevant retrieved context
the final result
any approvals, guardrail triggers, or escalation events

Without that structure, you cannot really tell:

whether the path was coherent
whether the wrong tool was chosen
whether the retry logic made sense
whether the agent drifted from the current state of the world
whether the system stayed inside policy

This is where instrumentation and evaluation meet.

Tracing gives you the raw material.

Trajectory evaluation gives you the judgment layer.

If you do not capture structured run data, then trajectory evaluation becomes mostly storytelling after the fact.

And if you capture everything but never judge it, then you have observability without real evaluation.

What Good Agent Evaluation Looks Like in Practice

The public literature is converging on the same broad lesson even when the terminology differs.

Anthropic, AWS, Google Cloud, Databricks, LangSmith-style trajectory evals, and newer step-quality work like AgentProcessBench all point in the same direction:

final-answer scoring still matters
intermediate behavior matters too
tool choice and parameter quality matter
retries and recovery matter
policy compliance matters
consistency across runs matters

That means a serious evaluation stack usually combines multiple layers:

Final outcome checks
Trajectory or step-quality checks
Tool-use correctness checks
Recovery and robustness checks
Policy and guardrail checks

In practice, those often map to eval forms like this:

Eval form	What it checks
`Final outcome checks`	whether the final answer or completed task was correct
`Trajectory checks`	whether the sequence of steps matched a sensible path
`Tool-call checks`	whether the right tools, arguments, and interpretations were used
`Policy assertions`	whether the run stayed inside permissions, approvals, and guardrails

Some teams will score these with custom datasets and judges.

Some will use trajectory evaluators in platforms like LangSmith.

Some will start with narrower internal rubrics around their most expensive or risky workflows.

Many mature systems end up with a stack that looks something like this:

Deterministic assertions for objective constraints
Milestone scoring for progress through the workflow
Judge-based scoring for sequence quality and reasoning quality
Outcome scoring for final task success
Long-run reliability checks across repeated executions

That last layer matters because one apparently good run proves much less than teams think.

A trajectory evaluation program should help you answer not just:

Did this run work?

but also:

Does this system usually behave well under realistic variation?

The tooling can vary.

The principle should not.

You need to know whether the run was trustworthy, not just whether the ending looked good.

The Failure Modes of Shallow Agent Evals

Most weak agent-evaluation programs fail in predictable ways.

They do one or more of these:

score only the final answer
confuse logging with evaluation
reward verbose reasoning instead of useful reasoning
ignore tool misuse when the answer happens to be correct
ignore recovery behavior completely
ignore policy fit because guardrails blocked the worst outcome

Many public articles make the same mistake in softer form.

They say evaluate the trajectory but never explain what should actually be judged inside the run.

That is why the article needs a concrete inspection model.

Without one, trajectory evaluation stays abstract.

FAQ

Why are normal output evals not enough for agent systems?

Because agents act over multiple steps.

A final answer can be correct even when the run that produced it was wasteful, brittle, unsafe, or out of policy.

What should I evaluate inside an agent run?

Start with the S.T.E.P. dimensions:

Sequence
Tooling
Error Recovery
Policy Fit

That gives you a practical way to inspect the run beyond the ending.

Is trajectory evaluation only relevant for multi-agent systems?

No.

It matters anywhere an agent acts over time.

A single-agent ReAct loop can still produce a bad trajectory.

Multi-agent systems just make the need more obvious because there are more handoffs and more places to fail.

When is full trajectory evaluation worth the effort?

It is most worth it when the agent can act over time, call tools, create side effects, or appear successful after a fragile run.

If the system is mostly a single bounded response, output evaluation may still be enough.

If the system can wander, retry, recover, or cross execution boundaries, the path itself deserves evaluation.

Do I need a golden trajectory or perfect reference path?

No.

Sometimes a reference trajectory is useful, especially when the workflow is stable and certain steps are mandatory.

But many real agent tasks allow several good paths.

In those cases, milestone scoring, programmatic assertions, and judge-based evaluation are often better than pretending there is only one correct route.

Do I need an LLM-as-a-judge to do trajectory evaluation?

Not always.

Use deterministic assertions where the rule is objective.

Use human review when the stakes are high or the rubric is still being formed.

Use LLM-as-a-judge when the path is flexible and you need scalable scoring, but only with a clear rubric and periodic human calibration.

Is trajectory evaluation the same as tracing?

No.

Tracing captures evidence about the run.

Trajectory evaluation uses that evidence to judge whether the run was good.

Can a bad trajectory still produce a good answer?

Yes.

That is exactly why this topic matters.

A lucky ending can hide a run that would not survive slightly different conditions in production.

Final Thought

For agent systems, evaluation is no longer only about correctness.

It is about trustworthiness.

The final answer still matters.

But once the system can plan, act, retry, recover, and cross execution boundaries, the run itself becomes part of what you are shipping.

That means the path has to be good enough to trust, not just lucky enough to finish.