Traces as Test Data: Using Production Runs to Improve Agent Quality

Production teaches faster than curated test sets do.

That is one of the big lessons of live agent systems.

Once an agent is operating against real users, real tools, real retrieval surfaces, and real approval pressure, the system starts revealing things the offline suite did not fully anticipate.

That is useful.

It is also easy to mishandle.

Some teams make the mistake of learning nothing from production.

Others make the opposite mistake and treat every strange live run like it should immediately become part of the permanent test set.

Both are weak operating habits.

The better discipline is:

treat production traces as evidence first
then decide which traces deserve promotion into future offline quality protection

This article follows Online Evals vs Offline Evals, Tracing and Observability for Agent Systems, Regression Testing for Agents, and Drift, Degradation, and Slow Failure in Long-Lived Agent Systems. Those pieces explain the live-vs-offline split, the evidence layer, the release gate, and the signs of weakening trust. This one explains how selected production traces should become future protection instead of staying only as post-hoc evidence.

Traces Are Evidence. Test Data Is Future Protection.

This distinction has to stay explicit.

A trace is the record of what happened in one run.

It helps a team:

debug
review
audit
explain
inspect live behavior

But not every trace should become part of the offline suite.

Test data is different.

Test data exists to influence future decisions:

should this change ship?
does this version regress?
did we protect against the same failure pattern this time?

So the move from trace to test data is not automatic.

It is a promotion decision.

That matters because production traces are messy.

They can contain:

one-off weirdness
privacy-sensitive content
duplicate cases
noise that is not actually useful as a durable quality signal

If a team promotes everything, the suite gets noisy and bloated.

If it promotes nothing, production keeps teaching the same lesson expensively.

The real job is selective promotion.

Why Teams Get This Wrong

There are a few recurring mistakes.

Promoting Nothing

The team looks at live failures, fixes the immediate issue, and moves on.

That leaves production doing the same teaching work over and over again.

The system gets patched.

It does not get much more protected.

Promoting Everything

This is the opposite failure.

Every odd run becomes:

a new fixture
a new edge case
a new test row
a new rubric clause

That sounds rigorous.

Usually it just creates a dirty suite full of:

duplicates
noise
cases nobody can explain well
examples that are too specific to generalize

Leaving Incidents as Anecdotes

Some teams know a run was important but never convert it into structured future protection.

It stays:

in a postmortem
in a dashboard screenshot
in someone’s memory

That is not enough.

If the case matters, it should eventually change the offline layer.

The P.R.O.M.O.T.E. Filter

A useful way to decide whether a trace deserves promotion is the P.R.O.M.O.T.E. filter:

Patterned
Risk-bearing
Operationally costly
Missed offline
Observable enough
Teachable
Enduring

This is not a vendor workflow.

It is a practical filter for deciding whether a production trace should become durable offline quality protection.

Patterned

Is this trace part of a real pattern rather than one weird run?

One isolated oddity is not always worth promotion.

But repeated weak behavior often is.

Look for:

recurring trajectory mistakes
repeated tool misuse
common retrieval misses
recurring approval stress

Risk-Bearing

Does this trace matter enough to protect against in the future?

That can mean:

safety risk
policy risk
customer trust risk
revenue risk
operational risk

If the pattern is harmless and low-impact, it may not deserve permanent suite space.

Operationally Costly

Some traces matter not because they are catastrophic, but because they are expensive or exhausting.

Examples:

high retry cost
repeated human rescue
long reviewer cleanup
unsustainable latency

These are exactly the cases teams often under-protect because the final answer still looked acceptable.

Missed Offline

Did the existing offline layer fail to represent this behavior well enough?

This is a crucial test.

If the current suite already covers the pattern and still failed to catch it, the issue may be:

a weak grader
a weak threshold
poor release discipline

If the suite does not cover it at all, then the trace is a stronger candidate for promotion.

Observable Enough

Can the team actually explain what happened well enough to turn this into a reusable case?

If the trace is too incomplete, too ambiguous, or too messy to teach anything durable, it may remain useful as observability evidence without being good test data.

Promotion should not reward confusion.

Teachable

Can this trace become a cleaner future protection artifact?

That might mean:

a regression fixture
a scenario row
a grader example
a tool-use assertion
a boundary test

If the trace cannot be translated into something the offline layer can actually use, it may not be ready for promotion yet.

Enduring

Will this pattern still matter after the immediate incident cools down?

Some traces are too tied to one temporary condition.

Others reveal a stable class of weakness that the team should protect against long term.

Those are the ones that deserve permanent suite real estate.

What Traces Are Worth Promoting

The most valuable traces usually come from a few buckets.

1. New Failure Classes

If a production trace reveals a class of failure the team had not represented offline, that is a strong candidate.

For example:

a tool-call pattern that looked valid but caused the wrong side effect
a retrieval edge case that only showed up under fresher live data
a recovery behavior that kept looping instead of escalating

2. Repeated Weak Trajectories

This is where Evaluating Agent Trajectories, Not Just Outputs matters again.

If live traces keep showing:

wasted steps
poor ordering
late tool correction
weak recovery patterns

then the problem is not only a live issue.

It is an offline protection gap.

3. Boundary-Stress Cases

Some traces matter because they expose pressure on the control system.

That includes:

repeated approval pressure
blocked unsafe actions
escalation failures
guardrail hits in sensitive paths

These often deserve promotion because they reveal where the system is pushing against the boundary the team actually cares about.

4. Slow-Failure Signals

This is where the previous article connects directly.

If live traces reveal:

rising rescue load
repeated retrieval weakness
growing friction
more retries under similar tasks

then the traces may deserve promotion not because of one headline failure, but because they reveal a quieter degrading pattern the offline layer should start protecting against.

What Should Not Be Promoted

This section matters as much as the previous one.

One-Off Noise

A trace can be weird without being durable.

If the team cannot show a real pattern, promotion is usually premature.

Privacy-Sensitive Raw Content

The point is not to dump raw production interactions into the suite indiscriminately.

Sometimes the right move is:

redact
abstract
summarize
reconstruct the case safely

The lesson may deserve promotion even when the raw trace does not.

Duplicates

If five traces all teach the same lesson, the suite probably does not need all five.

Promotion should increase protection, not just volume.

Unexplainable Curiosities

If the team still does not understand what happened, the trace may belong in review and debugging before it belongs in the permanent offline layer.

What Promotion Changes in Practice

A promoted trace should change the offline system in a concrete way.

That usually means one or more of:

a new regression case
a new scenario in a curated dataset
a revised grader rubric
a new trajectory check
a stronger tool-use assertion
a release-gate threshold update

For example, imagine a coding agent trace that shows a new weak recovery pattern:

it opens the right files
chooses the wrong edit path
retries twice
then gets corrected by human review

That trace might become:

a regression fixture for multi-file recovery behavior
a grader example for bad-but-salvageable trajectory quality
a new review rule for that workflow before merge automation is widened again

That is the shift the article is arguing for.

The value of the trace is not only that it helped explain yesterday’s problem.

The value is that it can protect tomorrow’s release.

How This Fits the Evaluation Loop

The full loop now becomes cleaner:

Tracing and Observability for Agent Systems captures the evidence.
Online Evals vs Offline Evals explains where live judgment and controlled judgment belong.
This article explains how selected live traces should feed the offline layer.
Regression Testing for Agents is where those promoted cases become future release protection.

That is how the quality system compounds.

Without the promotion step, teams often have:

observability
review
anecdotes

but not enough future protection.

A Practical Starting Point for Small Teams

Small teams do not need a giant trace-ingestion pipeline to do this well.

A good starting loop is:

review a small set of important live traces weekly
flag cases that pass the P.R.O.M.O.T.E. filter
rewrite those cases into safe, reusable offline fixtures
add them to the regression or scenario suite
note what changed in the release gate because of them

That is enough to move from:

production as a place where problems happen

to:

production as a place where better protection gets built

Traces Matter Most When They Change the Future

The deeper point is simple.

A trace is useful when it explains what happened.

A promoted trace is more valuable because it helps stop the same class of problem from surviving into the future.

That is the real quality-compounding move.

Not:

keep every trace forever
turn every failure into permanent suite noise
admire observability dashboards without changing the offline layer

But:

identify the traces that teach something durable
promote them carefully
use them to harden future release decisions

That is how production evidence becomes future quality protection.

FAQ

Should every production failure become a test?

No.

Only traces that reveal a durable, important, explainable pattern should usually be promoted into the offline layer.

What should stay as observability evidence only?

One-off noise, privacy-sensitive raw cases, duplicates, and traces the team cannot yet explain well enough to turn into reusable protection.

How does this differ from online evals?

Online evals judge live behavior now.

Trace promotion decides which live behaviors should shape future offline tests and release gates.

How does this connect to slow failure?

Repeated weak traces are often one of the clearest early signs of slow failure.

If those patterns keep appearing, promotion helps turn them into future protection instead of repeated surprise.

What comes next after this?

The strongest next follow-ons are:

the most common ways agents fail silently
using production runs to improve agent quality more broadly
failure detection and escalation loops in long-lived agent systems

This article closes the gap between seeing live evidence and institutionalizing that evidence in the offline quality system.