Production agent teams need two different evaluation layers.
They are not interchangeable.
Offline evals ask:
Before or around release, does this version meet the bar we intend to hold?
Online evals ask:
In live use, how is the system actually behaving under real traffic?
That distinction matters because many teams still collapse these categories into one vague bucket called evals.
When that happens, they usually make one of two mistakes:
- they over-trust offline test sets and ship with false confidence
- or they overreact in the other direction and try to treat raw production telemetry as the whole evaluation strategy
Neither is enough.
This article follows Regression Testing for Agents, Reliability Reviews for Agents, Tracing and Observability for Agent Systems, and Drift, Degradation, and Slow Failure in Long-Lived Agent Systems. Those articles explain how to protect releases, review trust, collect evidence, and catch slow decay. This one explains where to judge quality before release and where to judge it after the agent is already in the world.
What Offline Evals Are For
Offline evals are the controlled evaluation layer.
They happen before release, around release, or against selected historical data in a non-live setting.
They are how a team asks:
- does this candidate still solve the tasks it should solve?
- did this version regress?
- does it stay inside the safety and control boundary?
- are trajectories, tool calls, and outputs still acceptable under known scenarios?
Offline evals usually work with:
- curated datasets
- replayed traces
- reference-backed scenarios
- regression fixtures
- controlled prompt or workflow comparisons
The point is not that they are fake.
The point is that they are deliberate.
You control the scenarios, the grading criteria, and the comparison conditions closely enough to make a release decision.
That is why offline evals are the natural home for:
- release gating
- regression testing
- controlled capability comparisons
- tolerance thresholds
- repeatable safety checks
If the question is:
Should this change ship?
the answer should usually begin in the offline layer.
What Online Evals Are For
Online evals are the live-behavior judgment layer.
They use production traffic or recently live traces to score how the system is actually behaving once real users, real edge cases, and real operating pressure are involved.
That does not mean every dashboard is an online eval.
This distinction needs to stay sharp.
Raw observability tells you:
- what happened
- how often it happened
- where the latency or retry spike appeared
- which traces deserve inspection
An online eval adds explicit judgment logic on top of that evidence.
It samples or selects live interactions and asks questions like:
- was the final answer actually helpful?
- did the agent choose the right tool?
- did it stay grounded?
- did it recover well?
- did it create unnecessary approval pressure?
- did the run stay inside the intended quality boundary?
So online evals are not just live metrics.
They are live metrics plus a scoring or classification step.
If the question is:
How is the live system actually behaving now?
the answer has to come from the online layer.
Why One Layer Cannot Replace the Other
Offline evals are necessary because production is a bad place to discover obvious release mistakes for the first time.
If a model swap broke tool selection, if a prompt rewrite weakened policy handling, or if a new workflow path created obvious failure, the team should catch that before wide exposure.
That is offline eval work.
But offline evals are still incomplete.
They are built from what the team already knows to test.
Real traffic introduces:
- new user phrasing
- new tool and data conditions
- new long-tail cases
- new operator pressures
- new failure combinations
That is why online evals exist.
They surface the gap between:
- what the team simulated
- and what the live system actually encounters
The reverse is also true.
Online evals cannot replace offline evals because live traffic is the wrong place to do your basic release gating.
If every meaningful quality judgment only happens after deployment, the team is using production as the test harness.
That is not maturity.
That is late discovery.
The S.E.A.M. Loop
A useful way to connect the two layers is the S.E.A.M. loop:
SimulateExposeAbsorbMigrate
This is the operating loop good teams actually need.
Simulate
Use offline evals to simulate the cases you already know matter.
That includes:
- common workflows
- brittle tasks
- prior failures
- safety-sensitive scenarios
- trajectory and tool-use checks
This is where you decide whether a change deserves release.
Expose
Once the system is live, expose it to real traffic with the right guardrails, sampling, and visibility.
This is where online evals begin to matter.
You are no longer asking only what the system does on curated scenarios.
You are asking what it does when the world is less polite.
Absorb
When online evals reveal new failure patterns, do not leave them as anecdotes.
Absorb them into the operating memory of the team.
That means:
- label the failure pattern
- review the affected traces
- decide whether trust should narrow
- understand whether the issue is drift, regression escape, or long-tail mismatch
Migrate
Then move the important live failures back into offline evaluation.
This is the part many teams skip.
If a live issue stays only in a dashboard, the system has not really learned.
The new pattern should migrate into:
- regression fixtures
- scenario datasets
- grader rubrics
- reference examples
- release-gate checks
That is how the offline layer becomes less naive over time.
What Belongs in Offline Evals
Offline evals should own the things you need to compare in a controlled way before or around release.
That usually includes:
1. Release-Gate Checks
This is where Regression Testing for Agents fits directly.
Use offline evals to ask:
- did output quality regress?
- did trajectory quality regress?
- did tool or retrieval behavior regress?
- did cost, latency, or retry behavior move outside the release envelope?
2. Controlled Capability Checks
If the team is introducing a new capability, offline evals help answer whether the system can do it at all under known scenarios.
This is where reference-backed task sets and comparative runs matter most.
3. Safety and Control Checks
Offline evals should test the boundaries the team already knows it must protect:
- blocked actions
- approval order
- schema integrity
- execution boundaries
- escalation behavior
4. Repeatable Comparative Checks
Any question that benefits from stable side-by-side comparison belongs here.
That includes:
- prompt revisions
- model swaps
- tool-routing changes
- retrieval changes
- workflow rewrites
Offline evals are good when the team needs a fair comparison.
What Belongs in Online Evals
Online evals should own the things that only become legible once the system is live or close to live.
That usually includes:
1. Live Trace Quality
This is where Evaluating Agent Trajectories, Not Just Outputs and Tracing and Observability for Agent Systems connect.
Use online evals to judge sampled live traces for:
- trajectory quality
- tool correctness
- grounding quality
- recovery behavior
- task completion quality
2. Drift and Slow-Failure Detection
This is where the previous article matters directly.
Use online evals to see whether the live system is showing:
- rising rescue load
- growing inconsistency
- more friction
- more threshold pressure
- weaker recoveries under real workloads
Offline suites can hint at these problems.
Online evals tell you whether they are already happening.
3. Long-Tail Production Cases
Some failures do not show up until:
- a user phrases the request strangely
- multiple small weaknesses line up in one run
- a tool behaves oddly in a real environment
- the retrieval surface is fresher, messier, or less clean than the test set assumed
This is why online evals matter even for teams with strong offline discipline.
4. Live Trust Decisions
When online evals show worsening real behavior, they inform decisions like:
- tighten boundaries
- reduce autonomy
- increase human review
- pause a fragile tool path
- prioritize a fix before expanding rollout
This is where Reliability Reviews for Agents sits above the eval layers.
The review uses the evidence and decides what operating posture still deserves trust.
What Does Not Count
This article needs one blunt section because teams blur these categories too easily.
Observability Without Judgment Is Not an Online Eval
Metrics, logs, and traces matter.
But if no scoring, grading, labeling, or pass/fail logic is being applied, you have evidence, not an eval.
Static Benchmark Theater Is Not a Serious Offline Layer
If the offline suite is detached from the real workflows, real constraints, and real failure patterns of the product, it is not doing much useful release work.
Anecdotes Are Not a Quality System
A few memorable production failures can be useful signals.
They do not become a durable quality layer until they are turned into structured evaluation cases.
What Teams Commonly Get Wrong
Several mistakes show up repeatedly.
Treating Offline Evals as the Whole Story
This produces false confidence because the test set reflects only what the team remembered to encode.
Treating Online Evals as Just Dashboards
Without explicit grading logic, live monitoring remains incomplete.
Treating Online Evals as a Substitute for Release Gating
If the first serious quality judgment happens only after deployment, the team is shipping too blind.
Failing To Move Live Failures Back Into Offline Suites
This breaks the learning loop.
The same class of issue stays expensive because it keeps being rediscovered live.
A Practical Starting Point for Small Teams
Small teams do not need a giant evaluation platform to start using this split correctly.
A workable minimum version is:
Offline
- 10 to 20 important scenarios
- a regression harness for key workflows
- a few trajectory, tool, and control assertions
- one or two envelope thresholds
Online
- sampled trace review on important workflows
- one lightweight live grading layer
- basic drift signals around retries, rescue, or boundary pressure
- one rule for turning repeated live failures into new offline tests
That is enough to create a real feedback loop instead of two disconnected piles of metrics and tests.
Good Evaluation Systems Use Both Layers on Purpose
The deeper point is simple.
Offline evals and online evals are not rival camps.
They are different quality layers that answer different questions.
Offline evals answer:
- should this ship?
- is this version better or worse than the previous one?
- does it meet the bar under known scenarios?
Online evals answer:
- how is the live system actually behaving?
- what failure patterns are emerging under real traffic?
- where is trust getting weaker in production?
Production agent teams need both.
And the strongest systems connect them with a real loop:
- simulate offline
- expose live
- absorb new failure evidence
- migrate it back into the offline layer
That is how evaluation stops being a one-time gate and becomes a continuous quality discipline.
FAQ
Aren’t online evals just monitoring?
No.
Monitoring collects evidence. Online evals add explicit grading, labeling, or pass/fail logic to sampled live interactions.
Aren’t offline evals just regression tests?
Regression tests are one important offline-eval category, but offline evals are broader. They also include controlled capability checks, safety checks, and comparative pre-release tests.
What should move from online into offline?
Any repeated live failure pattern that matters should migrate back into the offline layer as a regression case, scenario fixture, rubric update, or dataset example.
Can small teams do this without a dedicated eval platform?
Yes.
The key requirement is not tooling first. It is keeping the two jobs separate: offline for controlled release judgment, online for live-behavior judgment.
What comes next after this article?
The next strong follow-ons are:
- traces as test data
- using production runs to improve agent quality
- the most common ways agents fail silently
This article closes the core Phase 5 distinction work by separating pre-release evaluation from live-system evaluation.