The median case got faster. The tail case collapsed. Every recent run lands inside a tight latency band — no more unpredictable 12–14-turn spikes.
Contractors place concrete orders from the field. They are not going to wait through a leisurely 12-turn chat to schedule a truck. End-to-end latency matters, but the shape of that latency matters more: an average run that lands in forty-five seconds is tolerable, an average run with a long tail at seventy-five seconds is a churn risk.
Pre-engagement traces showed exactly the latter. Most happy paths landed inside ten turns and roughly forty seconds of wall-clock. But a meaningful minority spiked to twelve or fourteen turns because the model fabricated an order ID the UI could not click, or skipped a required button render, or asked a question the flow had already answered. Those were ten-to-fifteen extra seconds of unpredictable drag, and contractors remembered them.
The same structural work that cut token costs also collapsed the latency variance. The unifying principle: remove work the LLM shouldn’t have been doing, and the failure modes that came with it disappear.
The headline is not a single number — it’s the collapse of variance. Every recent run lands in roughly the same latency band, and the tail traces that used to spike into the seventies-of-seconds range stopped happening.
Wall-clock came from the orchestration layer, per turn, summed per trace. We ran head-to-head happy-path traces against the same staging backend on the same day to control for network variance. More importantly, we watched the tail: over a two-week window of production traces after the change-set landed, we looked for fabricated-ID incidents and for runs that exceeded twelve turns. The class of tail behaviour we built the engagement around did not return.
This is the part worth highlighting for any operator reading this: we did not benchmark one happy path and declare victory. We watched the class of failures we cared about across a representative window, and the reduction held.
The biggest latency wins are almost never "switch models". They are usually "give the LLM less to be responsible for". If your copilot’s worst traces are the ones that scare you — fabricated IDs, dropped buttons, repeated questions, the model confidently restating something it got wrong two turns ago — the fix is structural, not prompt-engineering, and almost always leaves the model choice intact.
If you want a read on your own copilot, send us one representative trace and your system prompt. We will tell you in an hour what the dominant failure class is and whether the same pattern applies.
No. Same model, same tier, same vendor. Every number in this case came from the original model choice.
Sustained. We watched the class of tail failures across two weeks of post-engagement production traces. The behaviour did not return. The engagement also installed the observability that tells the client when it does return.
Yes, when the product is a chat agent walking a user through a structured operational workflow — tickets, rosters, routes, form-fill on structured records. The farther the product is from that shape (e.g. purely open-ended Q&A), the less the pattern applies.
Because operators remember the worst trace of their week, not the average one. A copilot whose average is forty seconds but whose tail is seventy-five seconds generates more support tickets, more "is this thing broken?" messages, and more churn than one whose average is forty-five but whose tail is forty-seven. Collapsing variance collapses the user’s perception of reliability in a way that average speed doesn’t.