Online Evals vs Offline Evals | AgentEngineering.org

Production agent teams need two different evaluation layers.

They are not interchangeable.

Offline evals ask:

Before or around release, does this version meet the bar we intend to hold?

Online evals ask:

In live use, how is the system actually behaving under real traffic?

That distinction matters because many teams still collapse these categories into one vague bucket called evals.

When that happens, they usually make one of two mistakes:

they over-trust offline test sets and ship with false confidence
or they overreact in the other direction and try to treat raw production telemetry as the whole evaluation strategy

Neither is enough.

This article follows Regression Testing for Agents, Reliability Reviews for Agents, Tracing and Observability for Agent Systems, and Drift, Degradation, and Slow Failure in Long-Lived Agent Systems. Those articles explain how to protect releases, review trust, collect evidence, and catch slow decay. This one explains where to judge quality before release and where to judge it after the agent is already in the world.

What Offline Evals Are For

Offline evals are the controlled evaluation layer.

They happen before release, around release, or against selected historical data in a non-live setting.

They are how a team asks:

does this candidate still solve the tasks it should solve?
did this version regress?
does it stay inside the safety and control boundary?
are trajectories, tool calls, and outputs still acceptable under known scenarios?

Offline evals usually work with:

curated datasets
replayed traces
reference-backed scenarios
regression fixtures
controlled prompt or workflow comparisons

The point is not that they are fake.

The point is that they are deliberate.

You control the scenarios, the grading criteria, and the comparison conditions closely enough to make a release decision.

That is why offline evals are the natural home for:

release gating
regression testing
controlled capability comparisons
tolerance thresholds
repeatable safety checks

If the question is:

Should this change ship?

the answer should usually begin in the offline layer.

What Online Evals Are For

Online evals are the live-behavior judgment layer.

They use production traffic or recently live traces to score how the system is actually behaving once real users, real edge cases, and real operating pressure are involved.

That does not mean every dashboard is an online eval.

This distinction needs to stay sharp.

Raw observability tells you:

what happened
how often it happened
where the latency or retry spike appeared
which traces deserve inspection

An online eval adds explicit judgment logic on top of that evidence.

It samples or selects live interactions and asks questions like:

was the final answer actually helpful?
did the agent choose the right tool?
did it stay grounded?
did it recover well?
did it create unnecessary approval pressure?
did the run stay inside the intended quality boundary?

So online evals are not just live metrics.

They are live metrics plus a scoring or classification step.

If the question is:

How is the live system actually behaving now?

the answer has to come from the online layer.

Why One Layer Cannot Replace the Other

Offline evals are necessary because production is a bad place to discover obvious release mistakes for the first time.

If a model swap broke tool selection, if a prompt rewrite weakened policy handling, or if a new workflow path created obvious failure, the team should catch that before wide exposure.

That is offline eval work.

But offline evals are still incomplete.

They are built from what the team already knows to test.

Real traffic introduces:

new user phrasing
new tool and data conditions
new long-tail cases
new operator pressures
new failure combinations

That is why online evals exist.

They surface the gap between:

what the team simulated
and what the live system actually encounters

The reverse is also true.

Online evals cannot replace offline evals because live traffic is the wrong place to do your basic release gating.

If every meaningful quality judgment only happens after deployment, the team is using production as the test harness.

That is not maturity.

That is late discovery.

The S.E.A.M. Loop

A useful way to connect the two layers is the S.E.A.M. loop:

Simulate
Expose
Absorb
Migrate

This is the operating loop good teams actually need.

Simulate

Use offline evals to simulate the cases you already know matter.

That includes:

common workflows
brittle tasks
prior failures
safety-sensitive scenarios
trajectory and tool-use checks

This is where you decide whether a change deserves release.

Expose

Once the system is live, expose it to real traffic with the right guardrails, sampling, and visibility.

This is where online evals begin to matter.

You are no longer asking only what the system does on curated scenarios.

You are asking what it does when the world is less polite.

Absorb

When online evals reveal new failure patterns, do not leave them as anecdotes.

Absorb them into the operating memory of the team.

That means:

label the failure pattern
review the affected traces
decide whether trust should narrow
understand whether the issue is drift, regression escape, or long-tail mismatch

Migrate

Then move the important live failures back into offline evaluation.

This is the part many teams skip.

If a live issue stays only in a dashboard, the system has not really learned.

The new pattern should migrate into:

regression fixtures
scenario datasets
grader rubrics
reference examples
release-gate checks

That is how the offline layer becomes less naive over time.

What Belongs in Offline Evals

Offline evals should own the things you need to compare in a controlled way before or around release.

That usually includes:

1. Release-Gate Checks

This is where Regression Testing for Agents fits directly.

Use offline evals to ask:

did output quality regress?
did trajectory quality regress?
did tool or retrieval behavior regress?
did cost, latency, or retry behavior move outside the release envelope?

2. Controlled Capability Checks

If the team is introducing a new capability, offline evals help answer whether the system can do it at all under known scenarios.

This is where reference-backed task sets and comparative runs matter most.

3. Safety and Control Checks

Offline evals should test the boundaries the team already knows it must protect:

blocked actions
approval order
schema integrity
execution boundaries
escalation behavior

4. Repeatable Comparative Checks

Any question that benefits from stable side-by-side comparison belongs here.

That includes:

prompt revisions
model swaps
tool-routing changes
retrieval changes
workflow rewrites

Offline evals are good when the team needs a fair comparison.

What Belongs in Online Evals

Online evals should own the things that only become legible once the system is live or close to live.

That usually includes:

1. Live Trace Quality

This is where Evaluating Agent Trajectories, Not Just Outputs and Tracing and Observability for Agent Systems connect.

Use online evals to judge sampled live traces for:

trajectory quality
tool correctness
grounding quality
recovery behavior
task completion quality

2. Drift and Slow-Failure Detection

This is where the previous article matters directly.

Use online evals to see whether the live system is showing:

rising rescue load
growing inconsistency
more friction
more threshold pressure
weaker recoveries under real workloads

Offline suites can hint at these problems.

Online evals tell you whether they are already happening.

3. Long-Tail Production Cases

Some failures do not show up until:

a user phrases the request strangely
multiple small weaknesses line up in one run
a tool behaves oddly in a real environment
the retrieval surface is fresher, messier, or less clean than the test set assumed

This is why online evals matter even for teams with strong offline discipline.

4. Live Trust Decisions

When online evals show worsening real behavior, they inform decisions like:

tighten boundaries
reduce autonomy
increase human review
pause a fragile tool path
prioritize a fix before expanding rollout

This is where Reliability Reviews for Agents sits above the eval layers.

The review uses the evidence and decides what operating posture still deserves trust.

What Does Not Count

This article needs one blunt section because teams blur these categories too easily.

Observability Without Judgment Is Not an Online Eval

Metrics, logs, and traces matter.

But if no scoring, grading, labeling, or pass/fail logic is being applied, you have evidence, not an eval.

Static Benchmark Theater Is Not a Serious Offline Layer

If the offline suite is detached from the real workflows, real constraints, and real failure patterns of the product, it is not doing much useful release work.

Anecdotes Are Not a Quality System

A few memorable production failures can be useful signals.

They do not become a durable quality layer until they are turned into structured evaluation cases.

What Teams Commonly Get Wrong

Several mistakes show up repeatedly.

Treating Offline Evals as the Whole Story

This produces false confidence because the test set reflects only what the team remembered to encode.

Treating Online Evals as Just Dashboards

Without explicit grading logic, live monitoring remains incomplete.

Treating Online Evals as a Substitute for Release Gating

If the first serious quality judgment happens only after deployment, the team is shipping too blind.

Failing To Move Live Failures Back Into Offline Suites

This breaks the learning loop.

The same class of issue stays expensive because it keeps being rediscovered live.

A Practical Starting Point for Small Teams

Small teams do not need a giant evaluation platform to start using this split correctly.

A workable minimum version is:

Offline

10 to 20 important scenarios
a regression harness for key workflows
a few trajectory, tool, and control assertions
one or two envelope thresholds

Online

sampled trace review on important workflows
one lightweight live grading layer
basic drift signals around retries, rescue, or boundary pressure
one rule for turning repeated live failures into new offline tests

That is enough to create a real feedback loop instead of two disconnected piles of metrics and tests.

Good Evaluation Systems Use Both Layers on Purpose

The deeper point is simple.

Offline evals and online evals are not rival camps.

They are different quality layers that answer different questions.

Offline evals answer:

should this ship?
is this version better or worse than the previous one?
does it meet the bar under known scenarios?

Online evals answer:

how is the live system actually behaving?
what failure patterns are emerging under real traffic?
where is trust getting weaker in production?

Production agent teams need both.

And the strongest systems connect them with a real loop:

simulate offline
expose live
absorb new failure evidence
migrate it back into the offline layer

That is how evaluation stops being a one-time gate and becomes a continuous quality discipline.

FAQ

Aren’t online evals just monitoring?

No.

Monitoring collects evidence. Online evals add explicit grading, labeling, or pass/fail logic to sampled live interactions.

Aren’t offline evals just regression tests?

Regression tests are one important offline-eval category, but offline evals are broader. They also include controlled capability checks, safety checks, and comparative pre-release tests.

What should move from online into offline?

Any repeated live failure pattern that matters should migrate back into the offline layer as a regression case, scenario fixture, rubric update, or dataset example.

Can small teams do this without a dedicated eval platform?

Yes.

The key requirement is not tooling first. It is keeping the two jobs separate: offline for controlled release judgment, online for live-behavior judgment.

What comes next after this article?

The next strong follow-ons are:

traces as test data
using production runs to improve agent quality
the most common ways agents fail silently

This article closes the core Phase 5 distinction work by separating pre-release evaluation from live-system evaluation.