We cut a dispatch copilot’s token bill by 59% — same model, same product

We cut a dispatch copilot’s token bill by 59% — same model, same product

Token consumption per happy path went from 75K to under 26K. The bill collapsed. The model stayed.

The problem

The copilot was shipped. It worked. But production traces told a harder story: the average happy-path conversation was burning roughly 75,000 tokens before the contractor could finalise an order. Token consumption was growing faster than order volume, and the per-order inference cost had stopped matching the per-order gross margin the feature was supposed to protect.

The model was not the problem. The stack was asking the model to do three jobs at once: reason about the next step, write the user-facing text, and decide every UI element that should render at every turn — which buttons, which map, which confirmation. Every one of those UI decisions travelled through the model as tool instructions, tool outputs, and retry loops. The bill was paying for orchestration theatre.

What we changed

We rebuilt the division of labour between the LLM and the surrounding application. Four structural changes, no model swap, no prompt-tuning tricks.

The result

Compared like-for-like against equivalent pre-engagement happy-path traces:

How we measured it

We benchmarked on paired traces: one representative happy path run against the pre-engagement build, one equivalent happy path run against the post-engagement build, same staging backend, same reorder scenario. Token counts come from the orchestration-layer observability, not from the model response. The 59% figure is from a clean follow-up run against a warmed-up build; the average across a handful of representative paths sat in the 40–60% range.

We also tracked the class of error we cared most about — UI fabrications, where the model either invented an ID or skipped rendering a required button — against production traces over a two-week window before and after. The class was present before the engagement. It stopped appearing after.

What this means for your copilot

If your AI copilot walks a user through a structured operational flow — orders, dispatches, tickets, rosters, routes — and the monthly inference bill is outrunning the business case for the feature, there is a very good chance the same pattern applies to you. The LLM is almost certainly doing work the surrounding code should own, and paying for that work twice: once to instruct it, once to retry when it gets it wrong.

A scoping call is the right first step. We can tell you in an hour, from a representative trace and a sample of the system prompt, whether this pattern fits your stack and how large the reduction is likely to be.

FAQ

Did you change the model?

No. Same model family, same tier. Every number in this case study was achieved against the original model choice. Model swaps are on the table later, once the structural waste is gone — but they are the second move, not the first.

Is a 59% reduction typical?

It’s at the high end of what we see, and it came from a clean follow-up run after the full change-set landed. The realistic band across a representative sample of happy paths is 40–60% on dispatcher-style copilots. On unusually prompt-heavy or tool-heavy setups we’ve seen larger reductions; on already-tight stacks, smaller.

How long did the engagement take?

Core changes landed in roughly four weeks. The longer tail — observability, evaluation harness, handover — ran a further four to six. Short, scoped, outcome-based; the client team operates the system now without us.

Can you do this on our stack without access to our source code?

No. Engagements require access to the orchestration layer — system prompt, tool definitions, and a representative sample of production traces. Read-only is fine for the audit phase; write access comes later if we agree on the change-set.

What’s the smallest version of this engagement?

A two-to-four-week audit with a prioritised punch list and a working proof-of-concept on one of the items. No long-term commitment. If you like the punch list, we do the implementation. If not, you keep the punch list.

Continue