Concrete is the hardest operational workflow to get AI right on. Orders must place against real plants, real trucks, real drivers, real pour windows. A fabricated ticket ID or a wrong quantity isn't a "hallucination" — it's a logistics failure that cascades through a whole shift. We've taken a production AI copilot for a US concrete-dispatch SaaS from 75K tokens per happy path to under 26K, and eliminated the fabricated-ID failure class entirely. Same model.
Past-orders and order-detail tools return raw JSON of dozens of fields. The model re-reads, re-summarises, and burns tokens on every subsequent turn. A thin server-side normaliser cuts that cost by ~75% per tool call.
UI element choice — button labels, option arrays, when to render the map — travels through the model as tool instructions and tool outputs. Tokens pay for UI twice: to instruct the model and to retry when it skips a required render. Moving UI emission into deterministic code saves tokens and makes fabricated IDs structurally impossible.
Load size, truck spacing, quantity, address — all in the past-orders table. A well-designed reorder prefills all of them server-side in one step; a naive one turns them into four follow-up questions and four extra LLM turns.
Typical dispatch prompts carry 40–60% of their content for rules the surrounding application could enforce deterministically. Every turn pays for those rules in tokens. Stripping them is the fastest structural win.
Per-call token and cost observability so every regression is a query, not an investigation
Server-side state machine for the order / reorder / reschedule flow — LLM stops driving, code does
Tool-output normalisation layer — compact text replaces raw JSON, 3–5× fewer tokens per tool call
UI emission pulled out of the model — fabricated IDs and dropped buttons become structurally impossible
Eval harness that watches the class of errors that matter (fabricated IDs, quantity drift, wrong plant) across every release