What does a production AI agent stack include?

Six layers: a model tier (often multiple models routed by task), orchestration (state machines or agentic loops), retrieval over your own data, an eval suite run on every change, observability (logging, cost, drift), and guardrails (bounded permissions, human escalation, rollback). A demo has the first layer; production needs all six.

Which LLM is best for AI agents?

The wrong question — production systems route by task. Frontier models where judgment quality binds, smaller models where latency and cost dominate, evaluated against your own test set rather than public benchmarks. Design for a model swap being a config change, because vendors ship breaking changes regularly.

Why do AI agents fail in production?

Silent drift after vendor model updates, compounding errors across multi-step workflows, unhandled low-confidence states, and cost blowouts at volume. All four are engineering problems solved by evals, bounded autonomy, escalation paths, and per-run instrumentation — not by better prompts.

The production AI agent stack: what we actually deploy

Every production agent we ship has the same six layers: a model tier, orchestration, retrieval, evals, observability, and guardrails. The model gets all the attention; the other five decide whether the system survives contact with real work. Demos have one layer. Production has six.

Models are a portfolio decision, not an allegiance. Frontier models where judgment quality is the constraint; smaller, faster models for classification and routing where latency and unit cost dominate. We're deliberately vendor-neutral — the stack is designed so a model swap is a config change plus an eval run, not a rewrite. When a vendor ships a breaking change (they do, regularly), that neutrality is what turns a fire drill into a Tuesday.

Orchestration is where the workflow lives: state machines for processes with defined steps and approval gates, agentic loops only where genuine judgment is required. The unfashionable truth is that most business workflows want *less* autonomy than the demos suggest — a reliable eight-step process with two human checkpoints beats an impressive free-roaming agent that's right 80% of the time.

Retrieval grounds the agent in your reality: your docs, your tickets, your data, fetched at answer time instead of memorized at training time. This is where data readiness shows up in the budget — clean sources make this layer a week; inboxes and PDFs make it a month.

Evals are the layer that separates firms that operate AI from firms that demo it. Before an agent ships, we build a scored test set from real cases — the awkward ones included — and every change afterward (prompt, model, retrieval) runs against it. No green run, no deploy. It's the same discipline we described in the five gates, and it's the reason 'the model got worse last week' becomes a measurable event instead of a vibe.

Observability and guardrails close the loop: every action logged with full context, cost and latency tracked per run, drift surfaced on dashboards; permissions bounded per step, low-confidence states escalating to humans, rollback paths rehearsed. This layer is most of what a Managed AI Operations retainer does all day — because agents don't fail loudly, they fail quietly, and only instrumentation notices quietly.

If you're scoping your first build, the stack above is what 'done' should mean in the statement of work — it's what we deploy in a Gigabit Agents build, and it's why the flat fee includes evals and observability rather than treating them as change orders. Anything less isn't a production agent; it's a demo with your logo on it.

The production AI agent stack: what we actually deploy

Questions this raises

What does a production AI agent stack include?

Which LLM is best for AI agents?

Why do AI agents fail in production?

Related insights

Why your AI pilot never reached production — and the five gates that get it there

How much does an AI agent cost in 2026? A real budget breakdown

Build vs. buy AI agents: a decision guide for operators

Put a forward-deployed team on it.