How to Review an AI Agent Demo Without Getting Fooled

A polished agent demo is the worst possible place to evaluate an AI agent.

That is the inconvenient frame nobody wants to start with, but it is the only honest one.

A demo is a curated environment. The inputs are clean. The context window is fresh. The tools are warm. The vendor knows which path through the workflow is going to work, and they walk you down exactly that path. You watch the agent succeed three times in a row and you start to believe.

Then you buy.

Then you wire it into real systems with real users and real edge cases, and the same agent that “just worked” in the demo solves about half as many tasks. Industry-cited numbers put the demo-to-production gap somewhere around 94 percent demo success rate to 52 percent production success rate. Gartner has projected that more than 40 percent of agentic AI projects will be cancelled by the end of 2027 — primarily because of reliability issues that the original demos completely failed to surface.

The good news is that you can do better.

A 30-minute demo can prove or disprove production readiness — but only if you stop watching and start testing.

This is the buyer-skepticism essay the AgentEngineering Opinions hub still owed. It is a companion to AgentOps Is the Missing Layer Between an AI Demo and a Real Product. That article was about what the vendor should be building behind the demo. This article is about what you, the reviewer, should be doing during the demo.

The math nobody puts in the deck

Multi-step agent workflows are governed by exponential decay.

If each step in a chain has a success probability p, the probability of the whole n-step chain succeeding is p^n.

That looks innocent until you put real numbers into it.

Per-step accuracy	5 steps	10 steps	20 steps
99%	95.1%	90.4%	81.8%
95%	77.4%	59.9%	35.8%
90%	59.0%	34.9%	12.2%

A 95-percent-per-step agent — which is already optimistic for current models on real tasks — is a 36-percent agent across a 20-step process.

A demo is a five-step process.

A production deployment is a 20-step process.

The vendor will not put this table in the deck. Hold it in your head when you watch.

Two kinds of demoware to learn to see

Most misleading agent demos fall into one of two categories.

Happy Path Engineering is when a real system is shown only in the conditions where it cannot fail. The agent exists. The tools exist. The integration is real. But the demo is a rehearsed walk down the one path the team has spent four months hardening. Step off that path and the system breaks the same way most software does.

Theater is something else. Theater is when the system on stage is not the system being sold. The agent might be a thin wrapper around a hardcoded flow. The “tool calls” might go to stubbed endpoints. The reasoning trace might be cosmetic. What you are watching is a video of the product, not the product.

The two categories require different responses. Happy Path Engineering is recoverable — the system is real and you can negotiate scope to its actual envelope. Theater is not. You cannot scope your way out of buying nothing.

The job of a demo reviewer is to tell which one is on stage.

The D.E.M.O. lens

Four tells. One word. Each one is something a real agent does naturally that scripted theater cannot fake under pressure.

Letter	Tell	Live test
D	Determinism	Run the same input twice
E	Edges	Ask for an input the demo wasn’t tuned on
M	Mistakes	Force a failure and watch the recovery
O	Off-Script	Change the input mid-run

If you force at least one of each during the demo session, you will know more about the system than the vendor wants you to know.

D — Determinism: be suspicious of perfect repetition

LLM-driven agents are non-deterministic systems. Run the same prompt through the same model with the same tools, and you should see small natural variation: a slightly different phrasing, a different order of tool calls, a different intermediate reasoning step.

If you watch a “live” agent demo and the same input produces a token-identical output two runs in a row, one of three things is true:

The output is cached.
The model is being called with temperature 0 and a fixed seed.
You are watching a video.

None of those are reasons to refuse to buy. Determinism can be engineered honestly — caching is fine, seeded calls are fine. But the vendor should be able to tell you which one is happening, and what changes when it isn’t.

The live test is one sentence: “Run the same input again.”

If they refuse, that is the answer.

E — Edges: ask for an input the demo wasn’t tuned on

Demos are tuned on a small input set. The cleanest way to find the edge of that set is to step just outside it.

Useful asks:

A paraphrase of the canonical demo input (“show me the same thing but written by a less experienced user”)
A malformed field (a missing argument, a wrong type, a half-typed sentence)
An adjacent task the agent should be able to handle but wasn’t featured
A combined task that touches two of the agent’s claimed capabilities at once
A multilingual input, if the agent is supposed to support it

This idea has a name in the literature now. The Mount Sinai team calls it factorial stress testing — systematically varying aspects of the input that should not change the correct answer (framing, order of facts, stated authority of the user) and watching what changes anyway. If the agent’s behavior shifts dramatically because the user identifies as “an expert,” the system has social-anchoring fragility and is going to behave unpredictably in production.

The signal you want is graceful degradation. A real agent on a less familiar input should produce a slightly worse answer, ask a clarifying question, or correctly say it can’t help. Theater shatters or refuses cleanly to anything off-script.

M — Mistakes: insist on seeing failure

Production-grade agents have a recovery story. Demoware does not.

If you can’t see the agent fail and recover during the demo, you cannot tell whether the recovery layer exists at all. Force it.

Useful asks:

Disable a tool the agent depends on and ask it to do the task anyway
Hand it a malformed response from a downstream system
Ask it a question whose answer requires data it does not have access to
Watch what happens when an external API times out (most demo environments mock the network — ask whether they can switch to a real call)

The pattern to watch for in a real agent: the failure is classified. The agent recognizes the kind of failure (auth expired, rate limited, tool returned empty, schema mismatch), picks an explicit fallback, and either tries something else or tells the user precisely what went wrong.

The pattern to watch for in demoware: the agent hallucinates success. The tool call returned 200 OK so the agent reports victory. The downstream record never moved. The customer was never notified. The deploy never happened. From the agent’s perspective, the workflow completed.

This is the single most expensive failure mode in production, and it is invisible in a demo unless you ask for the post-action verification — “show me the actual record was created, not just the API response.”

If the demo cannot show you the downstream change, the vendor has not built the integration. They have built the API call.

O — Off-Script: change the input mid-run

Real autonomy is the ability to reorient when conditions change. Scripted flows lose state.

Three off-script moves:

Interrupt and amend. Mid-task, add a new constraint. (“Actually, only do this if the customer is in the EU.”) A real agent picks up the constraint and reroutes. A scripted flow ignores it or starts over.
Contradict an earlier turn. (“I was wrong about the date — use the 18th, not the 15th.”) A real agent updates state and continues. Sycophantic theater agrees with the contradiction even when the new value is impossible.
Ask for a clean restart. (“Forget what I said and start over with this.”) Real agents reset cleanly. Theater carries hidden state from the prior turn into the new one and produces incoherent output.

This is the test scripted demos are least able to fake. Off-script behavior is exactly what production users do every day, and the agent that handles it gracefully is the agent that survives.

Hidden HITL — the deception named in plain English

The single most overstated capability claim in the agent market is autonomy.

The most common form of the deception is hidden human-in-the-loop. You watch the agent “execute” a workflow, but somewhere off-camera a human approves each step. The vendor will phrase it carefully: “your team gets notified and can step in if needed,” or “there’s a confirmation layer for high-stakes actions,” or “we keep a human in the loop for regulated environments.”

These phrases describe assisted workflow, not autonomous execution.

Both can be useful products. Only one is what the vendor is selling.

The questions that flush this out:

Who, if anyone, sees the agent’s intended action before it takes effect?
What is the median wall-clock time between intended action and applied action?
If your team disappeared for 24 hours, would the agent still be running?

If the answer to the third question is no, the system is not autonomous. That is fine. It is also a different product, with a different price, deployed in a different way. Make the vendor name it.

Five questions to ask the builder

After the live tests, shift from observable behavior to direct asks.

“Open the configuration panel, not the chat widget.” Show me where qualification logic, escalation triggers, and approved knowledge sources are configured. A real governance layer has partitioned access, versioned sources, and inspectable rules. “Our team configures that for you” is a black box answer.
“Show me one production trace from the last 24 hours.” The trace should connect the user request, the model calls, the tool invocations, and any policy checks in a single structured record. If they can show you a real trace from a real customer, the observability layer exists. If every trace is a sandbox, it doesn’t.
“What counts as a billable event?” A single user action might trigger five internal API calls. Without clarity, a “reasonable” budget can expand 5x in production. The good answer is a per-step or per-token cost model with a worked example. The bad answer is a flat per-seat number with no underlying unit.
“How do you handle model version drift?” The underlying foundation model will be updated whether you want it to be or not. The good answer is a version-pinning strategy plus a regression test suite that runs on every model change. The bad answer is “we always use the latest.” The terrible answer is “the customer doesn’t notice.”
“What’s the decommissioning plan?” AI systems do not last forever. The good answer describes data export, model handoff, and a contractually defined wind-down process. No answer at all is the bad answer; if no one has thought about leaving, no one has thought about staying responsibly either.

What a passing demo looks like

A demo passes the D.E.M.O. lens when:

the agent shows natural variation across two identical runs
it degrades gracefully on an off-script edge case
it recovers from a forced failure with classified error handling and post-action verification
it reorients when the input changes mid-run
the builder can show you inspectable traces, real configuration, a production trajectory, a billing model, a drift plan, and a decommissioning plan

You are not going to get a perfect score. You are looking for honest answers, real artifacts, and the absence of theater.

The demo that passes earns a paid pilot — short, scoped, with read-only production access, observable trajectories, and a written acceptable-error-rate threshold. That is where you find out whether the production system actually behaves the way the demo suggested.

The demo that fails the lens is not necessarily a fraud. It is sometimes a real engineering team that has been pushed by the sales motion to show capability they don’t yet have. Tell them which test failed. Tell them what would change your mind. Buyers who do this give honest builders a way to win, and they make the dishonest ones uncomfortable enough to leave the room.

That is the whole point.

You are not trying to break the demo to feel clever.

You are trying to tell the difference between a real agent and a scripted one — in the only window you will get to see them side by side.

If you can do that in 30 minutes, you will spend less money on demoware and more money on the systems that actually work after the curtain comes down.

FAQ

Q: Why do AI agent demos look so much better than what shows up in production? A: Demos run in fresh contexts on curated inputs. Production runs in long sessions with shifting schemas, expired tokens, malformed data, and compound probability decay across multi-step chains. The conditions barely overlap.

Q: Can I really evaluate a real agent in a 30-minute demo? A: Yes — if you stop watching and start testing. The D.E.M.O. lens is designed to fit inside a single demo session. Force at least one of each test (Determinism, Edges, Mistakes, Off-Script) and you will know more than the vendor wants you to know.

Q: What’s the single most useful question to ask during an agent demo? A: “Show me the same input twice.” If the result is identical to the token, the output is cached, seeded, or scripted. None of those are disqualifying; the vendor’s inability to tell you which one is.

Q: What’s the most common deceptive pattern in AI agent demos? A: Hidden human-in-the-loop framed as autonomy. Watch for any “your team gets notified” language. If a human is silently approving the agent’s actions before they take effect, the system is assisted workflow, not an autonomous agent.

Q: How do I tell a real tool integration from a fake one? A: Ask for post-action verification. “Show me the record was actually created, not just that the API returned 200.” Brittle systems treat a 200 OK as success. Reliable systems verify the downstream state changed.

Q: What should happen after a demo passes the D.E.M.O. lens? A: A short paid pilot with read-only production access, observable trajectories, and a written acceptable-error-rate threshold. That is where the demo’s claims meet real conditions.

Q: How does this connect to evaluation, regression testing, and AgentOps? A: Demo review is the buyer-side mirror of the engineering disciplines used inside the team that built the agent. The same patterns the builder uses to keep the system honest after release — trajectory evaluation, regression testing, tracing and observability, and AgentOps — are the same patterns you should expect to see evidence of during the demo.

Q: What if the vendor refuses to let me run these tests? A: That is the test result. Production AI systems are evaluated under adversarial conditions every day. A vendor who cannot tolerate four reasonable stress tests in a sales demo will not survive the first month of real use. Walk.