Most teams start agent evaluation in the wrong place.
They score the ending.
If the final answer looks correct, they call the run a success.
That works better for single-turn model calls than it does for agent systems.
Agents do not only answer.
They plan, retrieve, call tools, retry, recover, escalate, and cross execution boundaries.
That means a correct final answer can hide a bad run.
The real question is no longer:
Did the agent get the answer right?
It becomes:
Did the agent behave well enough to trust in production?
That is the job of trajectory evaluation.
This article builds on Planning and Task Decomposition, ReAct and the Basic Reasoning Loop, Tool Use: How Agents Take Action, Structured Outputs, Guardrails, and Execution Boundaries, Supervisor, Router, and Planner-Executor Patterns, and Human-in-the-Loop Control Design. Those pieces explain how agent systems think, act, coordinate, and stay bounded. This one explains how to judge whether the run itself was good.
The Trap of Final-Answer Scoring
The simplest failure mode is also the most dangerous:
the system looks correct at the end, so the team never notices that the path was weak.
Imagine an agent that prepares a customer refund.
It eventually returns the correct amount.
But along the way it:
- checked the wrong account first
- made failed tool calls before finding the right one
- used stale context before correcting itself
- proposed an out-of-policy action that a guardrail blocked
- succeeded only after a late retry recovered the run
The ending looks fine.
The trajectory does not.
That run may still be:
- too expensive
- too slow
- too brittle
- too risky
- too dependent on lucky recovery
This is why output-only evaluation breaks for agents.
A correct result does not prove a trustworthy process.
Output Quality vs Process Quality
This distinction has to be explicit.
Output quality asks whether the final answer, action, or completed task was correct.
Process quality asks whether the agent took a sensible, safe, and efficient path to get there.
Those are related.
They are not the same.
Output quality tells you whether the destination was acceptable.
Process quality tells you whether the system should be trusted to travel that route again.
For agent systems, both matter.
If you grade only the destination, you miss:
- silent failures
- hidden brittleness
- repeated retries
- tool misuse
- recoveries that only worked by luck
- policy violations that were blocked downstream instead of avoided upstream
That last case matters more than many teams realize.
A blocked unsafe action is not proof of a good run.
It may be proof that a downstream control rescued a bad run before it became an incident.
What a Trajectory Actually Is
A trajectory is the path the agent takes through the task.
In practice, that means the sequence of:
- reasoning steps
- tool calls
- observations
- retries
- corrections
- handoffs
- pauses
- completion decisions
If ReAct and the Basic Reasoning Loop explains the local thought-action-observation cycle, trajectory evaluation looks at the full run those local cycles create.
If Planning and Task Decomposition explains how the work is broken down, trajectory evaluation asks whether that decomposition actually produced a trustworthy path.
If Supervisor, Router, and Planner-Executor Patterns explains who controls the system, trajectory evaluation asks whether that orchestration produced good behavior.
That is why this topic sits naturally after control design.
Once an agent can act over time, the run becomes part of the product.
What Deserves Evaluation Inside the Run
Not every agent needs the same depth of review.
If the system is basically a single bounded response with no real action surface, output quality may still do most of the work.
Trajectory evaluation matters most when the system:
- acts over multiple steps
- calls tools or external services
- can create side effects
- may recover from intermediate mistakes before producing the final answer
For those systems, four dimensions matter again and again:
| Dimension | What you inspect | What failure looks like |
|---|---|---|
Sequence quality | Did the run follow a sensible path? | loops, wasted steps, wrong order, avoidable detours |
Tooling quality | Did it choose and use tools correctly? | wrong tool, bad arguments, misread tool output |
Recovery quality | Did it handle failure well? | repeated retries, no adaptation, brittle collapse |
Policy fit | Did it stay inside the allowed operating boundary? | unsafe proposals, permission drift, skipped checkpoints |
There is also one cross-cutting issue that deserves explicit attention:
state consistency.
Did the agent stay aligned with the actual state of the world as the run progressed?
This is where a lot of trajectories quietly degrade.
An agent may:
- keep reasoning from stale context after a tool result changed the situation
- forget that an earlier step already failed
- continue planning from an outdated assumption
- produce a locally coherent answer that is globally inconsistent with the current task state
State consistency does not need to become a fifth headline dimension every time, but it should be checked across the four that already matter.
Weak state consistency usually shows up as:
- poor sequence decisions
- misuse of tool outputs
- bad retries
- false confidence after the environment has already changed
This is the key shift.
You are no longer only grading what the agent said at the end.
You are grading how it behaved while it was acting.
The S.T.E.P. Trajectory Review
A simple way to inspect an agent run is the S.T.E.P. Trajectory Review.
S.T.E.P. stands for:
SequenceToolingError RecoveryPolicy Fit
This is not a vendor framework.
It is a compact way to inspect the run itself.
Applied to the refund example above, the review would look like this:
Sequence: did the agent check the customer and refund context in a sensible order, or did it wander through avoidable steps first?Tooling: did it call the right account and refund tools with the right arguments, or did it misuse them and recover later by luck?Error Recovery: once the first account lookup failed, did it adapt intelligently, or just keep retrying until something worked?Policy Fit: even if it found the right amount, did it propose any action that should have required a checkpoint or crossed the allowed operating boundary?
That is the practical point of the framework.
It gives you a way to judge the same run from four different angles instead of collapsing everything into one outcome score.
Sequence
Did the agent take a sensible route through the task?
Good sequence quality usually means:
- the task was decomposed reasonably
- the order of operations made sense
- the run converged instead of wandering
- extra steps were justified by new evidence rather than confusion
Bad sequence quality looks like:
- circular loops
- unnecessary substeps
- checking things too late
- acting before key information was confirmed
This is often where a run becomes expensive long before it becomes obviously wrong.
Tooling
Did the agent pick the right tool and use it correctly?
Good tooling quality usually means:
- the tool choice matched the need
- the arguments were correct
- the result was interpreted correctly
- the agent did not keep reaching for tools it did not need
Bad tooling quality looks like:
- wrong tool selection
- invented parameters
- repeated failed calls
- correct tool output followed by incorrect reasoning about what it means
This is where Tool Use: How Agents Take Action becomes evaluative instead of architectural.
Error Recovery
What happened after the run went off track?
Good recovery behavior usually means:
- the agent noticed the failure
- changed strategy instead of repeating itself
- rechecked the state of the world
- stopped when escalation or review was justified
Bad recovery behavior looks like:
- blind retries
- repeating the same failed action
- compounding an early mistake
- reaching the correct answer only by luck after several bad steps
Two runs may both succeed.
One may succeed cleanly.
The other may succeed after three fragile recoveries that will fail under slightly different conditions.
Those are not equally good runs.
Policy Fit
Did the run stay inside the allowed operating boundary?
This matters even when the final answer is right.
A run can be factually correct and still unacceptable if it:
- proposed a forbidden action
- crossed a permission boundary
- used an untrusted source
- skipped a required approval
- ignored a required review or escalation point
This is where trajectory evaluation connects directly back to Structured Outputs, Guardrails, and Execution Boundaries and Human-in-the-Loop Control Design.
The run must not only work.
It must work acceptably.
A Good Answer Can Still Come From a Bad Trajectory
This is the most important distinction in the article.
The final answer and the trajectory are related, but they are not interchangeable.
The simplest version is:
- a good answer from a good trajectory is the goal
- a bad answer from a bad trajectory is easy to reject
- a bad answer from a good trajectory is often fixable
- a good answer from a bad trajectory is dangerous because it creates false confidence
That last case is the one teams underestimate.
They think:
The system worked.
What actually happened may be:
The system survived this run.
Those are not the same thing.
Trajectory Evaluation Is Not the Same as Tracing
This distinction should stay clean because the next topic depends on it.
Trajectory evaluation asks:
What should be judged about the run?
Tracing and observability ask:
What evidence did we capture about the run, and can we inspect it later?
Tracing gives you the visibility layer:
- tool-call history
- timestamps
- state transitions
- intermediate outputs
- retry chains
- handoff visibility
Trajectory evaluation applies judgment to that evidence:
- was the sequence sound?
- was the tool use precise?
- was the recovery behavior acceptable?
- did the run stay inside policy?
So tracing is not the judgment.
It is the evidence layer that makes judgment possible.
But that still leaves an important practical question:
how do you actually score a trajectory once you have the trace?
There are three common approaches.
Reference Trajectories, Judge-Scored Trajectories, and Assertions
Once a team accepts that the run itself must be judged, the next question is:
what does the run get judged against?
In practice, most trajectory evaluation falls into three buckets.
1. Reference trajectories
This is the closest thing to answer-key evaluation for agent runs.
You define an expected path, or an expected set of acceptable steps, and compare the actual run against it.
That can be useful when:
- the workflow is relatively stable
- the correct order of operations is known
- certain steps are mandatory
- the domain is compliance-heavy or operationally strict
Examples:
- the agent must verify identity before changing account settings
- the agent must retrieve the latest record before drafting a response
- the agent must call the approval service before executing a sensitive action
The strength of reference trajectories is clarity.
They make it obvious when the agent skipped a required step, did things in the wrong order, or touched the wrong part of the workflow.
The weakness is rigidity.
Many good agent runs are not identical.
If the environment is dynamic, there may be several acceptable routes to the same good outcome.
So reference matching works best when the path is constrained enough that you genuinely care about adherence, not just eventual success.
2. Judge-scored trajectories
Sometimes there is no single perfect route.
The run still needs to be judged, but the judgment has to be more flexible.
That is where judge-based evaluation comes in.
A human reviewer or an LLM-as-a-judge looks at the run and scores things like:
- whether the sequence made sense
- whether the tool choice was appropriate
- whether the recovery behavior was intelligent
- whether the run stayed grounded and in policy
This is often the right approach when:
- multiple paths could be valid
- the work is partly subjective
- the task mixes reasoning, tool use, and judgment
- the team is still learning what good and bad runs look like
The strength of judge-scored trajectories is flexibility.
The weakness is that the rubric matters enormously.
If the scoring prompt is vague, the evaluation becomes vague.
If the judge is not calibrated to human standards, the scores become theater.
So judge-based evaluation is useful, but only when the team has a clear rubric for what should count as:
- effective progress
- neutral exploration
- harmful or invalid behavior
3. Programmatic assertions
Some parts of trajectory quality do not need subjective judgment at all.
They can be checked deterministically.
Examples:
- the agent used an allowed tool
- the tool arguments matched the schema
- the approval step happened before execution
- the retry count stayed within budget
- the final output included a required field
This layer matters because it keeps evaluation from becoming entirely opinion-driven.
Strong agent evaluation usually combines all three:
- reference checks where the path is known
- judge-based scoring where the path is flexible
- assertions where the system should be objectively testable
Milestones and Partial Credit Matter More Than Teams Expect
One weakness in naive trajectory evaluation is that it treats the run as either:
- a total success
- or a total failure
That is often too crude.
Many real tasks are multi-stage.
An agent may:
- gather the right information
- choose the right tools
- make good progress through the task
- and still fail in the final step
If you score only pass or fail, you lose the ability to see whether the system is improving.
That is why milestone-based evaluation matters.
Instead of judging only the ending, you define meaningful checkpoints inside the run.
For example, in a research workflow the milestones might be:
- Identify the right sources
- Extract the relevant facts
- Resolve conflicts or ambiguities
- Produce the final synthesis
In a customer-support workflow they might be:
- Retrieve the right account
- Verify policy constraints
- Prepare the right action
- Route to approval if needed
- Complete or draft the resolution
This does two useful things.
First, it tells you whether the agent is failing early or late.
Second, it gives partial credit for meaningful progress instead of flattening every failure into the same undifferentiated bucket.
That matters operationally.
An agent that consistently reaches the right milestones and fails at the final step is a different engineering problem from an agent that goes off track immediately.
Milestones also work well with the S.T.E.P. lens.
At each checkpoint, you can ask:
- was the sequence still sensible?
- was the tool use still precise?
- was the recovery behavior appropriate?
- did the run remain in policy?
So milestone scoring is not a replacement for trajectory evaluation.
It is one of the clearest ways to make trajectory evaluation more informative.
What Data You Actually Need to Evaluate a Trajectory
One reason teams stay vague on trajectory evaluation is that they never define the minimum evidence needed to do it.
You do not need a perfect observability platform before you can start.
But you do need more than the final answer.
At minimum, a usable trajectory record should preserve:
- the user request or task objective
- the intermediate steps or decisions
- tool calls and tool outputs
- retry attempts
- the evolving task state or relevant retrieved context
- the final result
- any approvals, guardrail triggers, or escalation events
Without that structure, you cannot really tell:
- whether the path was coherent
- whether the wrong tool was chosen
- whether the retry logic made sense
- whether the agent drifted from the current state of the world
- whether the system stayed inside policy
This is where instrumentation and evaluation meet.
Tracing gives you the raw material.
Trajectory evaluation gives you the judgment layer.
If you do not capture structured run data, then trajectory evaluation becomes mostly storytelling after the fact.
And if you capture everything but never judge it, then you have observability without real evaluation.
What Good Agent Evaluation Looks Like in Practice
The public literature is converging on the same broad lesson even when the terminology differs.
Anthropic, AWS, Google Cloud, Databricks, LangSmith-style trajectory evals, and newer step-quality work like AgentProcessBench all point in the same direction:
- final-answer scoring still matters
- intermediate behavior matters too
- tool choice and parameter quality matter
- retries and recovery matter
- policy compliance matters
- consistency across runs matters
That means a serious evaluation stack usually combines multiple layers:
- Final outcome checks
- Trajectory or step-quality checks
- Tool-use correctness checks
- Recovery and robustness checks
- Policy and guardrail checks
In practice, those often map to eval forms like this:
| Eval form | What it checks |
|---|---|
Final outcome checks | whether the final answer or completed task was correct |
Trajectory checks | whether the sequence of steps matched a sensible path |
Tool-call checks | whether the right tools, arguments, and interpretations were used |
Policy assertions | whether the run stayed inside permissions, approvals, and guardrails |
Some teams will score these with custom datasets and judges.
Some will use trajectory evaluators in platforms like LangSmith.
Some will start with narrower internal rubrics around their most expensive or risky workflows.
Many mature systems end up with a stack that looks something like this:
- Deterministic assertions for objective constraints
- Milestone scoring for progress through the workflow
- Judge-based scoring for sequence quality and reasoning quality
- Outcome scoring for final task success
- Long-run reliability checks across repeated executions
That last layer matters because one apparently good run proves much less than teams think.
A trajectory evaluation program should help you answer not just:
Did this run work?
but also:
Does this system usually behave well under realistic variation?
The tooling can vary.
The principle should not.
You need to know whether the run was trustworthy, not just whether the ending looked good.
The Failure Modes of Shallow Agent Evals
Most weak agent-evaluation programs fail in predictable ways.
They do one or more of these:
- score only the final answer
- confuse logging with evaluation
- reward verbose reasoning instead of useful reasoning
- ignore tool misuse when the answer happens to be correct
- ignore recovery behavior completely
- ignore policy fit because guardrails blocked the worst outcome
Many public articles make the same mistake in softer form.
They say evaluate the trajectory but never explain what should actually be judged inside the run.
That is why the article needs a concrete inspection model.
Without one, trajectory evaluation stays abstract.
FAQ
Why are normal output evals not enough for agent systems?
Because agents act over multiple steps.
A final answer can be correct even when the run that produced it was wasteful, brittle, unsafe, or out of policy.
What should I evaluate inside an agent run?
Start with the S.T.E.P. dimensions:
SequenceToolingError RecoveryPolicy Fit
That gives you a practical way to inspect the run beyond the ending.
Is trajectory evaluation only relevant for multi-agent systems?
No.
It matters anywhere an agent acts over time.
A single-agent ReAct loop can still produce a bad trajectory.
Multi-agent systems just make the need more obvious because there are more handoffs and more places to fail.
When is full trajectory evaluation worth the effort?
It is most worth it when the agent can act over time, call tools, create side effects, or appear successful after a fragile run.
If the system is mostly a single bounded response, output evaluation may still be enough.
If the system can wander, retry, recover, or cross execution boundaries, the path itself deserves evaluation.
Do I need a golden trajectory or perfect reference path?
No.
Sometimes a reference trajectory is useful, especially when the workflow is stable and certain steps are mandatory.
But many real agent tasks allow several good paths.
In those cases, milestone scoring, programmatic assertions, and judge-based evaluation are often better than pretending there is only one correct route.
Do I need an LLM-as-a-judge to do trajectory evaluation?
Not always.
Use deterministic assertions where the rule is objective.
Use human review when the stakes are high or the rubric is still being formed.
Use LLM-as-a-judge when the path is flexible and you need scalable scoring, but only with a clear rubric and periodic human calibration.
Is trajectory evaluation the same as tracing?
No.
Tracing captures evidence about the run.
Trajectory evaluation uses that evidence to judge whether the run was good.
Can a bad trajectory still produce a good answer?
Yes.
That is exactly why this topic matters.
A lucky ending can hide a run that would not survive slightly different conditions in production.
Final Thought
For agent systems, evaluation is no longer only about correctness.
It is about trustworthiness.
The final answer still matters.
But once the system can plan, act, retry, recover, and cross execution boundaries, the run itself becomes part of what you are shipping.
That means the path has to be good enough to trust, not just lucky enough to finish.