Article

Online Evals vs Offline Evals

Offline evals decide whether a change deserves release. Online evals judge how the live system is actually behaving under real traffic. Production agent teams need both, and they need them for different reasons.

Production agent teams need two different evaluation layers.

They are not interchangeable.

Offline evals ask:

Before or around release, does this version meet the bar we intend to hold?

Online evals ask:

In live use, how is the system actually behaving under real traffic?

That distinction matters because many teams still collapse these categories into one vague bucket called evals.

When that happens, they usually make one of two mistakes:

Neither is enough.

This article follows Regression Testing for Agents, Reliability Reviews for Agents, Tracing and Observability for Agent Systems, and Drift, Degradation, and Slow Failure in Long-Lived Agent Systems. Those articles explain how to protect releases, review trust, collect evidence, and catch slow decay. This one explains where to judge quality before release and where to judge it after the agent is already in the world.

What Offline Evals Are For

Offline evals are the controlled evaluation layer.

They happen before release, around release, or against selected historical data in a non-live setting.

They are how a team asks:

Offline evals usually work with:

The point is not that they are fake.

The point is that they are deliberate.

You control the scenarios, the grading criteria, and the comparison conditions closely enough to make a release decision.

That is why offline evals are the natural home for:

If the question is:

Should this change ship?

the answer should usually begin in the offline layer.

What Online Evals Are For

Online evals are the live-behavior judgment layer.

They use production traffic or recently live traces to score how the system is actually behaving once real users, real edge cases, and real operating pressure are involved.

That does not mean every dashboard is an online eval.

This distinction needs to stay sharp.

Raw observability tells you:

An online eval adds explicit judgment logic on top of that evidence.

It samples or selects live interactions and asks questions like:

So online evals are not just live metrics.

They are live metrics plus a scoring or classification step.

If the question is:

How is the live system actually behaving now?

the answer has to come from the online layer.

Why One Layer Cannot Replace the Other

Offline evals are necessary because production is a bad place to discover obvious release mistakes for the first time.

If a model swap broke tool selection, if a prompt rewrite weakened policy handling, or if a new workflow path created obvious failure, the team should catch that before wide exposure.

That is offline eval work.

But offline evals are still incomplete.

They are built from what the team already knows to test.

Real traffic introduces:

That is why online evals exist.

They surface the gap between:

The reverse is also true.

Online evals cannot replace offline evals because live traffic is the wrong place to do your basic release gating.

If every meaningful quality judgment only happens after deployment, the team is using production as the test harness.

That is not maturity.

That is late discovery.

The S.E.A.M. Loop

A useful way to connect the two layers is the S.E.A.M. loop:

This is the operating loop good teams actually need.

Simulate

Use offline evals to simulate the cases you already know matter.

That includes:

This is where you decide whether a change deserves release.

Expose

Once the system is live, expose it to real traffic with the right guardrails, sampling, and visibility.

This is where online evals begin to matter.

You are no longer asking only what the system does on curated scenarios.

You are asking what it does when the world is less polite.

Absorb

When online evals reveal new failure patterns, do not leave them as anecdotes.

Absorb them into the operating memory of the team.

That means:

Migrate

Then move the important live failures back into offline evaluation.

This is the part many teams skip.

If a live issue stays only in a dashboard, the system has not really learned.

The new pattern should migrate into:

That is how the offline layer becomes less naive over time.

What Belongs in Offline Evals

Offline evals should own the things you need to compare in a controlled way before or around release.

That usually includes:

1. Release-Gate Checks

This is where Regression Testing for Agents fits directly.

Use offline evals to ask:

2. Controlled Capability Checks

If the team is introducing a new capability, offline evals help answer whether the system can do it at all under known scenarios.

This is where reference-backed task sets and comparative runs matter most.

3. Safety and Control Checks

Offline evals should test the boundaries the team already knows it must protect:

4. Repeatable Comparative Checks

Any question that benefits from stable side-by-side comparison belongs here.

That includes:

Offline evals are good when the team needs a fair comparison.

What Belongs in Online Evals

Online evals should own the things that only become legible once the system is live or close to live.

That usually includes:

1. Live Trace Quality

This is where Evaluating Agent Trajectories, Not Just Outputs and Tracing and Observability for Agent Systems connect.

Use online evals to judge sampled live traces for:

2. Drift and Slow-Failure Detection

This is where the previous article matters directly.

Use online evals to see whether the live system is showing:

Offline suites can hint at these problems.

Online evals tell you whether they are already happening.

3. Long-Tail Production Cases

Some failures do not show up until:

This is why online evals matter even for teams with strong offline discipline.

4. Live Trust Decisions

When online evals show worsening real behavior, they inform decisions like:

This is where Reliability Reviews for Agents sits above the eval layers.

The review uses the evidence and decides what operating posture still deserves trust.

What Does Not Count

This article needs one blunt section because teams blur these categories too easily.

Observability Without Judgment Is Not an Online Eval

Metrics, logs, and traces matter.

But if no scoring, grading, labeling, or pass/fail logic is being applied, you have evidence, not an eval.

Static Benchmark Theater Is Not a Serious Offline Layer

If the offline suite is detached from the real workflows, real constraints, and real failure patterns of the product, it is not doing much useful release work.

Anecdotes Are Not a Quality System

A few memorable production failures can be useful signals.

They do not become a durable quality layer until they are turned into structured evaluation cases.

What Teams Commonly Get Wrong

Several mistakes show up repeatedly.

Treating Offline Evals as the Whole Story

This produces false confidence because the test set reflects only what the team remembered to encode.

Treating Online Evals as Just Dashboards

Without explicit grading logic, live monitoring remains incomplete.

Treating Online Evals as a Substitute for Release Gating

If the first serious quality judgment happens only after deployment, the team is shipping too blind.

Failing To Move Live Failures Back Into Offline Suites

This breaks the learning loop.

The same class of issue stays expensive because it keeps being rediscovered live.

A Practical Starting Point for Small Teams

Small teams do not need a giant evaluation platform to start using this split correctly.

A workable minimum version is:

Offline

Online

That is enough to create a real feedback loop instead of two disconnected piles of metrics and tests.

Good Evaluation Systems Use Both Layers on Purpose

The deeper point is simple.

Offline evals and online evals are not rival camps.

They are different quality layers that answer different questions.

Offline evals answer:

Online evals answer:

Production agent teams need both.

And the strongest systems connect them with a real loop:

That is how evaluation stops being a one-time gate and becomes a continuous quality discipline.

FAQ

Aren’t online evals just monitoring?

No.

Monitoring collects evidence. Online evals add explicit grading, labeling, or pass/fail logic to sampled live interactions.

Aren’t offline evals just regression tests?

Regression tests are one important offline-eval category, but offline evals are broader. They also include controlled capability checks, safety checks, and comparative pre-release tests.

What should move from online into offline?

Any repeated live failure pattern that matters should migrate back into the offline layer as a regression case, scenario fixture, rubric update, or dataset example.

Can small teams do this without a dedicated eval platform?

Yes.

The key requirement is not tooling first. It is keeping the two jobs separate: offline for controlled release judgment, online for live-behavior judgment.

What comes next after this article?

The next strong follow-ons are:

This article closes the core Phase 5 distinction work by separating pre-release evaluation from live-system evaluation.