Every production agent we ship has the same six layers: a model tier, orchestration, retrieval, evals, observability, and guardrails. The model gets all the attention; the other five decide whether the system survives contact with real work. Demos have one layer. Production has six.
Models are a portfolio decision, not an allegiance. Frontier models where judgment quality is the constraint; smaller, faster models for classification and routing where latency and unit cost dominate. We're deliberately vendor-neutral — the stack is designed so a model swap is a config change plus an eval run, not a rewrite. When a vendor ships a breaking change (they do, regularly), that neutrality is what turns a fire drill into a Tuesday.
Orchestration is where the workflow lives: state machines for processes with defined steps and approval gates, agentic loops only where genuine judgment is required. The unfashionable truth is that most business workflows want *less* autonomy than the demos suggest — a reliable eight-step process with two human checkpoints beats an impressive free-roaming agent that's right 80% of the time.
Retrieval grounds the agent in your reality: your docs, your tickets, your data, fetched at answer time instead of memorized at training time. This is where data readiness shows up in the budget — clean sources make this layer a week; inboxes and PDFs make it a month.
Evals are the layer that separates firms that operate AI from firms that demo it. Before an agent ships, we build a scored test set from real cases — the awkward ones included — and every change afterward (prompt, model, retrieval) runs against it. No green run, no deploy. It's the same discipline we described in the five gates, and it's the reason 'the model got worse last week' becomes a measurable event instead of a vibe.
Observability and guardrails close the loop: every action logged with full context, cost and latency tracked per run, drift surfaced on dashboards; permissions bounded per step, low-confidence states escalating to humans, rollback paths rehearsed. This layer is most of what a Managed AI Operations retainer does all day — because agents don't fail loudly, they fail quietly, and only instrumentation notices quietly.
If you're scoping your first build, the stack above is what 'done' should mean in the statement of work — it's what we deploy in a First Agent Deployment, and it's why the fixed price includes evals and observability rather than treating them as change orders. Anything less isn't a production agent; it's a demo with your logo on it.


