Article

Traces as Test Data: Using Production Runs to Improve Agent Quality

Production traces are not just for debugging. The best ones become future quality protection: regression fixtures, scenario cases, and stronger offline evals. The trick is knowing which traces deserve promotion.

Production teaches faster than curated test sets do.

That is one of the big lessons of live agent systems.

Once an agent is operating against real users, real tools, real retrieval surfaces, and real approval pressure, the system starts revealing things the offline suite did not fully anticipate.

That is useful.

It is also easy to mishandle.

Some teams make the mistake of learning nothing from production.

Others make the opposite mistake and treat every strange live run like it should immediately become part of the permanent test set.

Both are weak operating habits.

The better discipline is:

This article follows Online Evals vs Offline Evals, Tracing and Observability for Agent Systems, Regression Testing for Agents, and Drift, Degradation, and Slow Failure in Long-Lived Agent Systems. Those pieces explain the live-vs-offline split, the evidence layer, the release gate, and the signs of weakening trust. This one explains how selected production traces should become future protection instead of staying only as post-hoc evidence.

Traces Are Evidence. Test Data Is Future Protection.

This distinction has to stay explicit.

A trace is the record of what happened in one run.

It helps a team:

But not every trace should become part of the offline suite.

Test data is different.

Test data exists to influence future decisions:

So the move from trace to test data is not automatic.

It is a promotion decision.

That matters because production traces are messy.

They can contain:

If a team promotes everything, the suite gets noisy and bloated.

If it promotes nothing, production keeps teaching the same lesson expensively.

The real job is selective promotion.

Why Teams Get This Wrong

There are a few recurring mistakes.

Promoting Nothing

The team looks at live failures, fixes the immediate issue, and moves on.

That leaves production doing the same teaching work over and over again.

The system gets patched.

It does not get much more protected.

Promoting Everything

This is the opposite failure.

Every odd run becomes:

That sounds rigorous.

Usually it just creates a dirty suite full of:

Leaving Incidents as Anecdotes

Some teams know a run was important but never convert it into structured future protection.

It stays:

That is not enough.

If the case matters, it should eventually change the offline layer.

The P.R.O.M.O.T.E. Filter

A useful way to decide whether a trace deserves promotion is the P.R.O.M.O.T.E. filter:

This is not a vendor workflow.

It is a practical filter for deciding whether a production trace should become durable offline quality protection.

Patterned

Is this trace part of a real pattern rather than one weird run?

One isolated oddity is not always worth promotion.

But repeated weak behavior often is.

Look for:

Risk-Bearing

Does this trace matter enough to protect against in the future?

That can mean:

If the pattern is harmless and low-impact, it may not deserve permanent suite space.

Operationally Costly

Some traces matter not because they are catastrophic, but because they are expensive or exhausting.

Examples:

These are exactly the cases teams often under-protect because the final answer still looked acceptable.

Missed Offline

Did the existing offline layer fail to represent this behavior well enough?

This is a crucial test.

If the current suite already covers the pattern and still failed to catch it, the issue may be:

If the suite does not cover it at all, then the trace is a stronger candidate for promotion.

Observable Enough

Can the team actually explain what happened well enough to turn this into a reusable case?

If the trace is too incomplete, too ambiguous, or too messy to teach anything durable, it may remain useful as observability evidence without being good test data.

Promotion should not reward confusion.

Teachable

Can this trace become a cleaner future protection artifact?

That might mean:

If the trace cannot be translated into something the offline layer can actually use, it may not be ready for promotion yet.

Enduring

Will this pattern still matter after the immediate incident cools down?

Some traces are too tied to one temporary condition.

Others reveal a stable class of weakness that the team should protect against long term.

Those are the ones that deserve permanent suite real estate.

What Traces Are Worth Promoting

The most valuable traces usually come from a few buckets.

1. New Failure Classes

If a production trace reveals a class of failure the team had not represented offline, that is a strong candidate.

For example:

2. Repeated Weak Trajectories

This is where Evaluating Agent Trajectories, Not Just Outputs matters again.

If live traces keep showing:

then the problem is not only a live issue.

It is an offline protection gap.

3. Boundary-Stress Cases

Some traces matter because they expose pressure on the control system.

That includes:

These often deserve promotion because they reveal where the system is pushing against the boundary the team actually cares about.

4. Slow-Failure Signals

This is where the previous article connects directly.

If live traces reveal:

then the traces may deserve promotion not because of one headline failure, but because they reveal a quieter degrading pattern the offline layer should start protecting against.

What Should Not Be Promoted

This section matters as much as the previous one.

One-Off Noise

A trace can be weird without being durable.

If the team cannot show a real pattern, promotion is usually premature.

Privacy-Sensitive Raw Content

The point is not to dump raw production interactions into the suite indiscriminately.

Sometimes the right move is:

The lesson may deserve promotion even when the raw trace does not.

Duplicates

If five traces all teach the same lesson, the suite probably does not need all five.

Promotion should increase protection, not just volume.

Unexplainable Curiosities

If the team still does not understand what happened, the trace may belong in review and debugging before it belongs in the permanent offline layer.

What Promotion Changes in Practice

A promoted trace should change the offline system in a concrete way.

That usually means one or more of:

For example, imagine a coding agent trace that shows a new weak recovery pattern:

That trace might become:

That is the shift the article is arguing for.

The value of the trace is not only that it helped explain yesterday’s problem.

The value is that it can protect tomorrow’s release.

How This Fits the Evaluation Loop

The full loop now becomes cleaner:

  1. Tracing and Observability for Agent Systems captures the evidence.
  2. Online Evals vs Offline Evals explains where live judgment and controlled judgment belong.
  3. This article explains how selected live traces should feed the offline layer.
  4. Regression Testing for Agents is where those promoted cases become future release protection.

That is how the quality system compounds.

Without the promotion step, teams often have:

but not enough future protection.

A Practical Starting Point for Small Teams

Small teams do not need a giant trace-ingestion pipeline to do this well.

A good starting loop is:

That is enough to move from:

to:

Traces Matter Most When They Change the Future

The deeper point is simple.

A trace is useful when it explains what happened.

A promoted trace is more valuable because it helps stop the same class of problem from surviving into the future.

That is the real quality-compounding move.

Not:

But:

That is how production evidence becomes future quality protection.

FAQ

Should every production failure become a test?

No.

Only traces that reveal a durable, important, explainable pattern should usually be promoted into the offline layer.

What should stay as observability evidence only?

One-off noise, privacy-sensitive raw cases, duplicates, and traces the team cannot yet explain well enough to turn into reusable protection.

How does this differ from online evals?

Online evals judge live behavior now.

Trace promotion decides which live behaviors should shape future offline tests and release gates.

How does this connect to slow failure?

Repeated weak traces are often one of the clearest early signs of slow failure.

If those patterns keep appearing, promotion helps turn them into future protection instead of repeated surprise.

What comes next after this?

The strongest next follow-ons are:

This article closes the gap between seeing live evidence and institutionalizing that evidence in the offline quality system.