Article

Evaluating Agent Trajectories, Not Just Outputs

A correct final answer does not prove that an agent behaved well. Agent evaluation has to judge the run itself: the sequence, tool use, recovery behavior, and policy fit that produced the answer.

Most teams start agent evaluation in the wrong place.

They score the ending.

If the final answer looks correct, they call the run a success.

That works better for single-turn model calls than it does for agent systems.

Agents do not only answer.

They plan, retrieve, call tools, retry, recover, escalate, and cross execution boundaries.

That means a correct final answer can hide a bad run.

The real question is no longer:

Did the agent get the answer right?

It becomes:

Did the agent behave well enough to trust in production?

That is the job of trajectory evaluation.

This article builds on Planning and Task Decomposition, ReAct and the Basic Reasoning Loop, Tool Use: How Agents Take Action, Structured Outputs, Guardrails, and Execution Boundaries, Supervisor, Router, and Planner-Executor Patterns, and Human-in-the-Loop Control Design. Those pieces explain how agent systems think, act, coordinate, and stay bounded. This one explains how to judge whether the run itself was good.

The Trap of Final-Answer Scoring

The simplest failure mode is also the most dangerous:

the system looks correct at the end, so the team never notices that the path was weak.

Imagine an agent that prepares a customer refund.

It eventually returns the correct amount.

But along the way it:

The ending looks fine.

The trajectory does not.

That run may still be:

This is why output-only evaluation breaks for agents.

A correct result does not prove a trustworthy process.

Output Quality vs Process Quality

This distinction has to be explicit.

Output quality asks whether the final answer, action, or completed task was correct.

Process quality asks whether the agent took a sensible, safe, and efficient path to get there.

Those are related.

They are not the same.

Output quality tells you whether the destination was acceptable.

Process quality tells you whether the system should be trusted to travel that route again.

For agent systems, both matter.

If you grade only the destination, you miss:

That last case matters more than many teams realize.

A blocked unsafe action is not proof of a good run.

It may be proof that a downstream control rescued a bad run before it became an incident.

What a Trajectory Actually Is

A trajectory is the path the agent takes through the task.

In practice, that means the sequence of:

If ReAct and the Basic Reasoning Loop explains the local thought-action-observation cycle, trajectory evaluation looks at the full run those local cycles create.

If Planning and Task Decomposition explains how the work is broken down, trajectory evaluation asks whether that decomposition actually produced a trustworthy path.

If Supervisor, Router, and Planner-Executor Patterns explains who controls the system, trajectory evaluation asks whether that orchestration produced good behavior.

That is why this topic sits naturally after control design.

Once an agent can act over time, the run becomes part of the product.

What Deserves Evaluation Inside the Run

Not every agent needs the same depth of review.

If the system is basically a single bounded response with no real action surface, output quality may still do most of the work.

Trajectory evaluation matters most when the system:

For those systems, four dimensions matter again and again:

DimensionWhat you inspectWhat failure looks like
Sequence qualityDid the run follow a sensible path?loops, wasted steps, wrong order, avoidable detours
Tooling qualityDid it choose and use tools correctly?wrong tool, bad arguments, misread tool output
Recovery qualityDid it handle failure well?repeated retries, no adaptation, brittle collapse
Policy fitDid it stay inside the allowed operating boundary?unsafe proposals, permission drift, skipped checkpoints

There is also one cross-cutting issue that deserves explicit attention:

state consistency.

Did the agent stay aligned with the actual state of the world as the run progressed?

This is where a lot of trajectories quietly degrade.

An agent may:

State consistency does not need to become a fifth headline dimension every time, but it should be checked across the four that already matter.

Weak state consistency usually shows up as:

This is the key shift.

You are no longer only grading what the agent said at the end.

You are grading how it behaved while it was acting.

The S.T.E.P. Trajectory Review

A simple way to inspect an agent run is the S.T.E.P. Trajectory Review.

S.T.E.P. stands for:

This is not a vendor framework.

It is a compact way to inspect the run itself.

Applied to the refund example above, the review would look like this:

That is the practical point of the framework.

It gives you a way to judge the same run from four different angles instead of collapsing everything into one outcome score.

Sequence

Did the agent take a sensible route through the task?

Good sequence quality usually means:

Bad sequence quality looks like:

This is often where a run becomes expensive long before it becomes obviously wrong.

Tooling

Did the agent pick the right tool and use it correctly?

Good tooling quality usually means:

Bad tooling quality looks like:

This is where Tool Use: How Agents Take Action becomes evaluative instead of architectural.

Error Recovery

What happened after the run went off track?

Good recovery behavior usually means:

Bad recovery behavior looks like:

Two runs may both succeed.

One may succeed cleanly.

The other may succeed after three fragile recoveries that will fail under slightly different conditions.

Those are not equally good runs.

Policy Fit

Did the run stay inside the allowed operating boundary?

This matters even when the final answer is right.

A run can be factually correct and still unacceptable if it:

This is where trajectory evaluation connects directly back to Structured Outputs, Guardrails, and Execution Boundaries and Human-in-the-Loop Control Design.

The run must not only work.

It must work acceptably.

A Good Answer Can Still Come From a Bad Trajectory

This is the most important distinction in the article.

The final answer and the trajectory are related, but they are not interchangeable.

The simplest version is:

That last case is the one teams underestimate.

They think:

The system worked.

What actually happened may be:

The system survived this run.

Those are not the same thing.

Trajectory Evaluation Is Not the Same as Tracing

This distinction should stay clean because the next topic depends on it.

Trajectory evaluation asks:

What should be judged about the run?

Tracing and observability ask:

What evidence did we capture about the run, and can we inspect it later?

Tracing gives you the visibility layer:

Trajectory evaluation applies judgment to that evidence:

So tracing is not the judgment.

It is the evidence layer that makes judgment possible.

But that still leaves an important practical question:

how do you actually score a trajectory once you have the trace?

There are three common approaches.

Reference Trajectories, Judge-Scored Trajectories, and Assertions

Once a team accepts that the run itself must be judged, the next question is:

what does the run get judged against?

In practice, most trajectory evaluation falls into three buckets.

1. Reference trajectories

This is the closest thing to answer-key evaluation for agent runs.

You define an expected path, or an expected set of acceptable steps, and compare the actual run against it.

That can be useful when:

Examples:

The strength of reference trajectories is clarity.

They make it obvious when the agent skipped a required step, did things in the wrong order, or touched the wrong part of the workflow.

The weakness is rigidity.

Many good agent runs are not identical.

If the environment is dynamic, there may be several acceptable routes to the same good outcome.

So reference matching works best when the path is constrained enough that you genuinely care about adherence, not just eventual success.

2. Judge-scored trajectories

Sometimes there is no single perfect route.

The run still needs to be judged, but the judgment has to be more flexible.

That is where judge-based evaluation comes in.

A human reviewer or an LLM-as-a-judge looks at the run and scores things like:

This is often the right approach when:

The strength of judge-scored trajectories is flexibility.

The weakness is that the rubric matters enormously.

If the scoring prompt is vague, the evaluation becomes vague.

If the judge is not calibrated to human standards, the scores become theater.

So judge-based evaluation is useful, but only when the team has a clear rubric for what should count as:

3. Programmatic assertions

Some parts of trajectory quality do not need subjective judgment at all.

They can be checked deterministically.

Examples:

This layer matters because it keeps evaluation from becoming entirely opinion-driven.

Strong agent evaluation usually combines all three:

Milestones and Partial Credit Matter More Than Teams Expect

One weakness in naive trajectory evaluation is that it treats the run as either:

That is often too crude.

Many real tasks are multi-stage.

An agent may:

If you score only pass or fail, you lose the ability to see whether the system is improving.

That is why milestone-based evaluation matters.

Instead of judging only the ending, you define meaningful checkpoints inside the run.

For example, in a research workflow the milestones might be:

  1. Identify the right sources
  2. Extract the relevant facts
  3. Resolve conflicts or ambiguities
  4. Produce the final synthesis

In a customer-support workflow they might be:

  1. Retrieve the right account
  2. Verify policy constraints
  3. Prepare the right action
  4. Route to approval if needed
  5. Complete or draft the resolution

This does two useful things.

First, it tells you whether the agent is failing early or late.

Second, it gives partial credit for meaningful progress instead of flattening every failure into the same undifferentiated bucket.

That matters operationally.

An agent that consistently reaches the right milestones and fails at the final step is a different engineering problem from an agent that goes off track immediately.

Milestones also work well with the S.T.E.P. lens.

At each checkpoint, you can ask:

So milestone scoring is not a replacement for trajectory evaluation.

It is one of the clearest ways to make trajectory evaluation more informative.

What Data You Actually Need to Evaluate a Trajectory

One reason teams stay vague on trajectory evaluation is that they never define the minimum evidence needed to do it.

You do not need a perfect observability platform before you can start.

But you do need more than the final answer.

At minimum, a usable trajectory record should preserve:

Without that structure, you cannot really tell:

This is where instrumentation and evaluation meet.

Tracing gives you the raw material.

Trajectory evaluation gives you the judgment layer.

If you do not capture structured run data, then trajectory evaluation becomes mostly storytelling after the fact.

And if you capture everything but never judge it, then you have observability without real evaluation.

What Good Agent Evaluation Looks Like in Practice

The public literature is converging on the same broad lesson even when the terminology differs.

Anthropic, AWS, Google Cloud, Databricks, LangSmith-style trajectory evals, and newer step-quality work like AgentProcessBench all point in the same direction:

That means a serious evaluation stack usually combines multiple layers:

  1. Final outcome checks
  2. Trajectory or step-quality checks
  3. Tool-use correctness checks
  4. Recovery and robustness checks
  5. Policy and guardrail checks

In practice, those often map to eval forms like this:

Eval formWhat it checks
Final outcome checkswhether the final answer or completed task was correct
Trajectory checkswhether the sequence of steps matched a sensible path
Tool-call checkswhether the right tools, arguments, and interpretations were used
Policy assertionswhether the run stayed inside permissions, approvals, and guardrails

Some teams will score these with custom datasets and judges.

Some will use trajectory evaluators in platforms like LangSmith.

Some will start with narrower internal rubrics around their most expensive or risky workflows.

Many mature systems end up with a stack that looks something like this:

  1. Deterministic assertions for objective constraints
  2. Milestone scoring for progress through the workflow
  3. Judge-based scoring for sequence quality and reasoning quality
  4. Outcome scoring for final task success
  5. Long-run reliability checks across repeated executions

That last layer matters because one apparently good run proves much less than teams think.

A trajectory evaluation program should help you answer not just:

Did this run work?

but also:

Does this system usually behave well under realistic variation?

The tooling can vary.

The principle should not.

You need to know whether the run was trustworthy, not just whether the ending looked good.

The Failure Modes of Shallow Agent Evals

Most weak agent-evaluation programs fail in predictable ways.

They do one or more of these:

Many public articles make the same mistake in softer form.

They say evaluate the trajectory but never explain what should actually be judged inside the run.

That is why the article needs a concrete inspection model.

Without one, trajectory evaluation stays abstract.

FAQ

Why are normal output evals not enough for agent systems?

Because agents act over multiple steps.

A final answer can be correct even when the run that produced it was wasteful, brittle, unsafe, or out of policy.

What should I evaluate inside an agent run?

Start with the S.T.E.P. dimensions:

That gives you a practical way to inspect the run beyond the ending.

Is trajectory evaluation only relevant for multi-agent systems?

No.

It matters anywhere an agent acts over time.

A single-agent ReAct loop can still produce a bad trajectory.

Multi-agent systems just make the need more obvious because there are more handoffs and more places to fail.

When is full trajectory evaluation worth the effort?

It is most worth it when the agent can act over time, call tools, create side effects, or appear successful after a fragile run.

If the system is mostly a single bounded response, output evaluation may still be enough.

If the system can wander, retry, recover, or cross execution boundaries, the path itself deserves evaluation.

Do I need a golden trajectory or perfect reference path?

No.

Sometimes a reference trajectory is useful, especially when the workflow is stable and certain steps are mandatory.

But many real agent tasks allow several good paths.

In those cases, milestone scoring, programmatic assertions, and judge-based evaluation are often better than pretending there is only one correct route.

Do I need an LLM-as-a-judge to do trajectory evaluation?

Not always.

Use deterministic assertions where the rule is objective.

Use human review when the stakes are high or the rubric is still being formed.

Use LLM-as-a-judge when the path is flexible and you need scalable scoring, but only with a clear rubric and periodic human calibration.

Is trajectory evaluation the same as tracing?

No.

Tracing captures evidence about the run.

Trajectory evaluation uses that evidence to judge whether the run was good.

Can a bad trajectory still produce a good answer?

Yes.

That is exactly why this topic matters.

A lucky ending can hide a run that would not survive slightly different conditions in production.

Final Thought

For agent systems, evaluation is no longer only about correctness.

It is about trustworthiness.

The final answer still matters.

But once the system can plan, act, retry, recover, and cross execution boundaries, the run itself becomes part of what you are shipping.

That means the path has to be good enough to trust, not just lucky enough to finish.