AI Engineer Conference Talk: The Production AI Playbook

Today: The five pillars that separate production AI from expensive pilots, the actual judge prompt we use for evaluation, and a banking case study where the model wasn't picked until week seven.

Jun 20, 2026

I recorded a talk at the AI Engineer Conference in London in April 2026, and I’m sharing the full recording.

A quick note before you watch: this was recorded in April 2026, and the pace of change in this space means a few things have already moved on. Databricks has since shipped more platform features that make some of these pillars easier to implement than I describe in the talk. My own thinking keeps shifting too, as I talk to more customers and fold new patterns into the framework. I’ll keep sharing those updates here and on the YouTube channel as they land.

Here’s what it covers.

The talk is called “The Production AI Playbook” and the core argument is this: most AI projects fail for the same five reasons, and almost none of them are about the model.

I open with a pattern most of you will recognise: the Week 1 to 14 doom loop. Teams pick a model, build features, demo to leadership, ship, and watch it fall apart within weeks because nobody built the infrastructure underneath it. Gartner puts enterprise AI project failure above 40%. I’ve watched it happen first-hand, more than once.

From there I walk through five pillars, each one a dependency for the next:
Evaluation,
Observability,
Data Foundation,
Orchestration,
Governance.

Two pillars get particular depth. On evaluation, I show the actual judge prompt structure we use for LLM-as-judge scoring, including the fix for non-determinism: running each test case three times and flagging anything with high score variance before it ships. On orchestration, I cover the patterns that hold up in production against the ones that only survive in demos, including failure modes nobody puts on slides: context window bleed, cascading failures, and trust boundary violations between agents.

The spine of the talk is a real case study: a retail bank, 18,000 calls a month, a prior attempt that burned $85,000 over six months with nothing shipped. The second attempt took eight weeks, and the model wasn’t chosen until week seven. Everything before that was evaluation and infrastructure. The result: 87% accuracy, 62% call deflection, and a tripled API-call bug caught in two hours that would otherwise have cost $43,200 a year in wasted fees.

I also include a section on what I’d do differently. Three things that surprised us even after the framework held up: the test case library needs a named owner or it quietly rots, prompt version logs need to capture intent and not just the diff, and behavioural evals cost far more to run at scale than most teams budget for.

If you’re building or running production AI, this one will be useful.

Here’s in the link to the resources I talk about in the video.

I aslo did an online talk in the same conference on Multi-Agent Orchestration Patterns, more tehcnical deep dive into patterns and failures modes when you build agentic workflows.

Talk soon,
Sandi

P.S. If you’re new here - welcome 🎉. AgentBuild is a community of practitioners working through the real challenges of getting AI into production inside large organisations. Every week I share practical, grounded thinking from the people doing this work at the sharp end. The goal is never theory - it’s always: what can you use Monday morning.

Ask your friends to join.
More valuable content coming your way.

agentbuild.ai

Discussion about this post

Ready for more?