agentbuild.ai

The Evaluation Graph: Why Your AI Pipelines Are Lying to You

Sandipan Bhaumik — Sat, 02 May 2026 13:31:15 GMT

Here is a pattern I have seen more times than I can count.

A team deploys an AI system into production. It passes every evaluation they ran. Accuracy looked good. The stakeholder demo went well. The pilot was declared a success. Three months later, the system is quietly shelved because the outputs no longer make sense - or worse, they never did, and nobody caught it until real users started complaining.

When I dig into what went wrong, I often find the shape of the evaluation resulting in low quality agentic decisions.

The teams running linear eval pipelines - input goes in, score comes out - are measuring a snapshot of a moment. They are not measuring how their system behaves as context shifts, as data drifts, as agents hand off to other agents, as the real world does what the real world always does. They are measuring a straight line. Their system is a graph.

That mismatch is why so many AI evaluations feel thorough and turn out to be worthless.

Pipelines vs Graphs

The word ‘pipeline’ is everywhere in AI engineering. Data pipelines, inference pipelines, eval pipelines. We’ve adopted it as the default mental model for how AI systems work.

And for a lot of data engineering, it’s correct. Data flows in one direction. You extract, you transform, you load. A pipeline is a clean metaphor because data really does flow like water through a pipe.

But AI systems in production - especially multi-agent systems, RAG architectures, and anything that has to maintain context across multiple turns or tool calls - do not behave like pipelines. They behave like graphs. There are loops. There are conditional branches. There are nodes that depend on the state of other nodes that were resolved two steps earlier. Context that was established at step one can poison or distort the output at step seven.

When you evaluate a graph as if it were a pipeline, you get a false sense of confidence. You test the happy path. You test the input-output pair. You miss the edges. You miss the feedback loops. You miss the context that has been accumulating and silently corrupting your system’s reasoning.

I’ve started calling this context drift - the phenomenon where a system’s outputs because the context it’s operating in has shifted in ways your evaluations weren’t designed to detect. A pipeline eval can’t catch context drift. Only a graph-shaped evaluation can.

What is an Evaluation Graph?

The Evaluation Graph is not a tool or a framework you install. It’s a mental model - a different way of thinking about what you’re actually evaluating and when.

In a pipeline eval, you define a set of test cases, run your system against them, and score the outputs. Done. Repeatable. Clean.

Evaluation Graph Concept - Generated by Author

In an evaluation graph, you map out the nodes of your system - the points where decisions are made, where context is retrieved, where agents hand off to each other, where state is read or written - and you evaluate at each node, not just at the final output.

Here is what that changes in practice.

First, you gain localised failure detection. When a pipeline eval fails, you know something went wrong. You don’t know where. When a graph eval fails, you know exactly which node broke down - was it the retrieval? The reranker? The summarisation step? The router that decided which agent to call? You can fix what’s actually broken instead of rerunning the whole system hoping for different results.

Second, you can evaluate context propagation. I have seen this skipped many times. It’s not enough to evaluate whether each node produces a good output given its input. You need to evaluate whether the context being passed between nodes is coherent, relevant, and not accumulating noise. I’ve seen systems where individual components all scored above 90% in isolation, but the system as a whole produced nonsense because each node was passing slightly degraded context to the next one. No pipeline eval would catch that.

Third, you can evaluate decision boundaries. Multi-agent systems have routing logic - conditions that decide which agent runs next, or whether to escalate, or whether to call a tool. These decision boundaries are often the most fragile part of a production AI system, and they’re almost never tested explicitly. In an evaluation graph, they are nodes. They get evaluated just like everything else.

How to Build One

Starting with the evaluation graph doesn’t require you to throw away your existing evals. It requires you to extend them in a specific direction.

The first step is decomposition. Draw out your system - literally, on a whiteboard or in a diagram - and identify every point where a meaningful decision is made or meaningful state changes. Each of those points is a node. Each connection between nodes is an edge. What you’re drawing is the evaluation graph. Most teams are surprised by how many nodes they find that they’ve never evaluated.

The second step is context mapping. For each edge in the graph, define what context is being passed from one node to the next. What does the downstream node need to function correctly? What could the upstream node pass that would corrupt the downstream output? These become your edge-level test cases - not just input-output pairs, but context-propagation scenarios.

The third step is failure mode enumeration. For each node, ask: what does this node look like when it’s failing quietly? Not failing loudly - that’s easy to catch. Quiet failures are the dangerous ones. A retrieval node that returns plausible but wrong documents. A router that sends requests to the wrong agent 15% of the time. A summarisation step that subtly omits the most important information. These failure modes need to be in your evaluation suite explicitly. If they’re not, you won’t find them until a user does.

The fourth step, and this is where graph-shaped evaluation really separates from pipeline evaluation is composing your node-level evals into end-to-end scenarios that test the interaction effects. Not just ‘does node A work’ and ‘does node B work’, but ‘when node A produces this class of output, does node B degrade in a predictable way’. The interactions between nodes are often where production AI systems fail.

This Needs a Shift in Mindset

The teams that build AI systems that hold up in production are building the most rigorous evaluation infrastructure. And rigorous evaluation infrastructure starts with a simple question: is my evaluation shaped like my system?

If your system is a graph and your evaluations are pipelines, you have a gap. That gap is where production failures live.

The evaluation graph is not a perfect solution - no evaluation framework is. Context still drifts in ways you won’t anticipate. Failure modes you didn’t enumerate will still appear. But it gets you structurally closer to what’s actually happening in your system, and that’s the difference between catching problems in staging and catching them after a customer has seen them.

This is one of the core concepts I’m currently working on. If it resonates with what you’re seeing in your own work, I’d genuinely like to hear about it. Hit reply and tell me what you see. The patterns you share inform what I write next.

Talk soon,
Sandi

👉 I wrote more about the Eval Graphs in my article on Atlan’s community substack.

Context & Chaos

Context Graphs as AI Evaluation Infrastructure

About the Author: Sandipan Bhaumik have spent almost 2 decades building Data & AI foundations. Now, through AgentBuild Weekly, he shares how builders and founders can move beyond AI hype to create Agentic systems that think, adapt, and truly work…

25 days ago · 11 likes · Sandipan Bhaumik

P.S. If you’re new here - welcome 🎉. AgentBuild is a community of practitioners working through the real challenges of getting AI into production inside large organisations. Every week I share practical, grounded thinking from the people doing this work at the sharp end. The goal is never theory - it’s always: what can you use Monday morning.

Ask your friends to join.

More valuable content coming your way.

Share agentbuild.ai

Why Solution Architects Are the Real Force Behind Enterprise AI Transformation

Sun, 26 Apr 2026 10:43:30 GMT

There’s a role inside every enterprise AI programme that nobody has a clean job title for. It isn’t the VP who sponsors the initiative. It isn’t the data scientist who builds the model. It isn’t the product manager who writes the requirements.

It’s the person who gets pulled into the room when the demo worked brilliantly and the deployment didn’t. The person who has to figure out why a system that impressed everyone in the boardroom is now sitting in a security review queue with no clear owner, no evaluation criteria, and a go-live deadline nobody wants to move.

That person is usually a Solutions Architect.

And in this new world of AI transfomration architects are lacking the frameworks to match the responsibility they’ve been handed.

What the role has become

I’ve spent eighteen years in enterprise data and AI. The last several watching what happens when organizations decide to take AI seriously.

Here’s what I keep seeing: Solutions Architects are becoming the load-bearing wall of AI transformation programmes. By default.

They’re the ones who understand both the technology and the business context. They’re trusted enough to sit in executive sessions and technical ones. They have enough credibility to push back on vendor claims and enough pragmatism to know what actually ships.

So they get handed things. Big things.

Define the production readiness criteria. Assess whether the data infrastructure can support this use case. Figure out who owns the outcome when the model is wrong. Translate what the VP wants into something the engineering team can build. Get security and compliance aligned before the launch date nobody will move.

That’s not an implementation role. That’s an organizational diagnostic role. And most architects are not prepared for it.

The gap nobody names

The architects I see are not struggling because they can’t build. They can build. They’re struggling because the job has shifted from build to diagnose, and they don’t yet have the instruments for it.

When a doctor walks into a room, they’re not improvising. They have a diagnostic protocol. Repeatable questions. Known patterns. A framework that tells them what to look for and in what order, so they can tell the difference between something that needs immediate intervention and something that needs monitoring.

Right now, most architects walking into an AI programme are improvising. Drawing on instinct built from past projects. Pattern-matching against things they’ve seen before, hoping the pattern holds.

Sometimes it does. Often it doesn’t.

And when it doesn’t, the cost isn’t just the failed project. It’s the six months of organizational trust that went with it. The next AI initiative that’s three times harder to fund because this one didn’t ship. The architect who now has a complicated story to tell about why the thing they led didn’t work.

What real preparation looks like

I’ve been thinking for a long time about what it would mean to give architects the diagnostic tools they actually need. Something closer to a practitioner’s handbook for the organizational side of AI deployment. Not a vendor comparison or a tutorial on which framework to use.

The kind of resource that helps you walk into an early-stage AI programme and ask the right questions before anyone starts building. That gives you a structured way to identify where the real risk is - not the model risk, but the Data Debt sitting in pipelines that haven’t been touched in three years. The Decision Debt in an organization where nobody has agreed on who owns an AI error. The Evaluation Debt in a team that’s been running vibe checks and calling it validation.

The kind of resource that helps you have the conversation with the VP that reframes the whole initiative - not as a technology project, but as an organizational readiness problem that happens to have a technology solution.

That’s the conversation that changes outcomes. And most architects don’t have a framework for it yet.

Why I’m spending time on this

I’ve watched enough of these programmes - close enough to see the failure modes in detail - that the patterns are starting to feel predictable. Which means these are preventable.

I can walk into a kickoff meeting now and have a reasonable sense of what’s going to go wrong six months later. Not because I’m smarter than anyone in the room. Because I’ve seen it before. Enough times that it’s stopped feeling like bad luck and started feeling like a diagnostic problem with a known set of causes.

What I want to do - what I’m actively working on - is make that pattern recognition transferable. To give architects the frameworks that took me years of seeing things go wrong to develop, so they don’t have to learn the same lessons at the same cost.

That’s the work I’m orienting around. That’s the shape I want to give this community.

If you’re an architect who’s been handed one of these programmes - or knows you’re about to be - I’d genuinely like to hear what’s hard about it right now?

Talk soon,
Sandi

Ask your friends to join.
More valuable content coming your way.

Share agentbuild.ai

Thanks for reading agentbuild.ai! Subscribe for free to receive new posts and support my work.

Decision Traces: The Missing Black Box ✈️ for AI Agents

Sat, 18 Apr 2026 13:31:15 GMT

Hey everyone,

Hope you all are doing well. Today, I am bringing up something that is increasingly coming up in my customer discussions. Not everyone is giving it a name, but the requirements they define clearly point to building Decision Traces.

Let me explain.

Flight data recorders (FDR) are popularly know as the black box of the aircraft. Technically the black box contains the FDR and the Cockpit Voice Recorder (CVR). The FDR turns raw sensor traces into timelines that explain crashes, reveal root causes, and drive global aviation safety improvements. Before flight data recorders were mandated, when something went wrong, investigators worked from witness accounts, wreckage patterns, and whatever instruments happened to be installed at the time. The analysis was mostly incomplete, often contradictory, and it rarely led to systemic changes.

The aviation industry knew flying was becoming more consequential with more routes, more passengers, more complex airspace, but the infrastructure to understand why things failed hadn’t kept pace with the deployment of the systems themselves. It was invented in response to the recognition that consequential systems operating at scale need a structured record of their reasoning - not just their outcomes.

AI agents are at the same inflection point today. And the industry, especially in the regulated space is recognizing that.

What’s the gap right now?

An AI agent deployed in a production system today typically produces two things: an input log and an output. What it doesn’t produce is the reasoning chain between them in any structured, queryable, auditable way .

This gap has a name. It’s called Decision Debt, one of the three categories of debt that block AI from working in production. Decision Debt accumulates when you build and deploy AI systems before defining how decisions get made, recorded, and reviewed. It’s not a future problem. It’s accumulating now, in every agent deployment that ships without trace infrastructure.

A decision trace is the record of how an agent got from context to conclusion: what it knew, what it considered, what it weighted, what it discarded, and at what confidence level it committed to an action.

IMPORTANT: It’s not a log file. Logs capture events. Traces capture reasoning.

The distinction matters because when something goes wrong - and in any system operating at scale, you need to answer a different set of questions than a log can address.

Answering “what happened” is not enough, but “why did the agent conclude that, given what it had access to?” matters more.

This is a historical pattern

Aviation gets to the black box through painful iteration. Financial services gets to trade surveillance infrastructure the same way. Healthcare builds clinical decision support audit trails only after near-misses force the question.

The pattern across every regulated industry is identical: consequential system deploys, operates without adequate observability, incident occurs, retroactive audit reveals the trace infrastructure was never built, expensive fixes follow.

What’s different with AI agents is that we can see this pattern coming before the incidents accumulate. The decision trace problem is visible now, in advance, to anyone who has watched the previous cycles play out in adjacent domains.

Nuclear power operations built decision logging infrastructure into control room design before widespread deployment. And that’s not because regulators demanded it initially, but because the engineers understood that a system making consequential decisions in real time needed to be interrogable after the fact.

The Chernobyl investigation was partially possible because operator actions were timestamped and sequenced. The lessons extracted shaped reactor design globally.

The equivalent for AI agents isn’t complicated in principle.

It is, however, work that almost nobody has started.

What Decision Trace Infrastructure actually need?

The architecture for a decision trace system has five functional layers, and each one has a specific job. Here’s how they fit together.

Illustration of a Decision Trace Pipleine

Input Capture Service is where the trace begins - at the moment a request enters the system. Query, user identity, session context, and request metadata are captured here, backed by a metadata store (PostgreSQL can be a straightforward choice). This is the “who asked what, when, and from where” layer. Without it, you have no anchor for the rest of the trace.
State Retrieval and Context Snapshot captures the world as the agent saw it at decision time: which data versions were active, which policy definitions were in force, which catalog references were resolved. This layer pulls from a prerequisites datastore - Redis for low-latency state, S3 for snapshot durability. It’s also the layer that makes post-incident analysis possible. When you need to understand why the agent concluded what it did three months ago, you need to know what it knew at that moment - not what the system knows now.
This layer is, in practical terms, where a context graph lives - even if most implementations don't call it that yet. A context graph is simply the structured representation of what the agent knew and how those things related to each other at decision time: data assets, policies, catalog nodes, versions, and their connections. The reason "context graph" is gaining traction as a term without a settled definition is precisely because this layer has been missing from most agent architectures. Once you build the snapshot layer properly, you have one.
Reasoning Chain and Decision Engine is the core trace layer. Chain-of-thought steps, intermediate logic, intermediate outputs - all captured as structured records. It is the path the agent took to reach the final answer. Every branch, every intermediate conclusion, every tool invocation that shaped the reasoning is an addressable record here.
Policy Binding Service records the guardrails, rules, and decision logic that were active during the reasoning process. This is what separates a decision trace from a debugging log. You’re not just capturing what the agent did, you’re capturing the constraints it was operating under. When a compliance team asks “was the agent following the policy that was in force on this date,” this layer answers that question directly.
Outcome and Action Capturing records the final response, the action taken, and critically - any redress or complaint data attached to that outcome. This closes the loop between the agent’s decision and its real-world consequence. It’s also the layer that feeds dispute resolution workflows when customers or regulators challenge an outcome.

All five layers feed into an Immutable Audit Record - timestamped, hashed, and written to an immutable trace store (S3, Delta Lake or a ledger database). The immutability is a must-have. It is the architectural guarantee that the record cannot be altered after the fact, which is what makes it defensible in a regulatory or legal context. The diagram you see specifies retention period, which aligns with financial services conduct requirements and is a reasonable baseline for any regulated environment.

From the trace store, a Trace Query Service and Data Lake make the records queryable at scale. This is the operational distinction you need to understand. A queryable trace lets you ask: “Show me every decision where the policy binding service applied rule X and the outcome was Y.” That’s the difference between evidence and insight.

The four downstream outputs from this architecture tell you exactly what it’s designed to serve: Redress and Dispute Resolution (when a decision is challenged), Audit Trail Reporting (when a regulator asks), Debugging and Root Cause Analysis (when something fails), and Improvement and ML Training (when you want to make the system better using real decision data).

No single layer here is novel in isolation. Input capture, immutable storage, policy versioning exist in adjacent systems already. What doesn’t exist yet, in any standardised form for AI agents, is this stack assembled as a coherent, purpose-built trace infrastructure. That’s the gap this architecture closes.

Why enterprise architects need to move on this now

This infrastructure is necessary to build the cleanest parth through regulatory scrutiny, incident response, and enterprise customer due diligence. The same principle made structured engineering logging standard practice in distributed systems. You cannot debug what you cannot observe. You cannot improve what you cannot measure. And you cannot defend in a board meeting, a regulatory inquiry, or a customer audit what you never recorded.

Decision traces are the observability layer for AI reasoning. It is the infrastructure equivalent of distributed tracing in microservices, now applied to systems that don’t just execute code, but form conclusions and take actions.

The good news is that this is buildable now, with current tooling, before the incidents force it. The question is whether engineering organisations treat it as foundational infrastructure from the first production deployment, or discover its absence after the fact.

Call To Action

If you are building AI agents for anything consequential, the time to design trace infrastructure is before the first production deployment. Start by mapping which decisions your agent makes that you could not currently explain, audit, or defend: that list is your build priority.

If this framing is useful, share it with the architect or engineering lead on your AI team - this is the conversation that needs to happen before the system goes live, not after.

Leave a feedback or comment. Share your opinion about this topic.

Thanks,
Sandi.

👉 You might also find my article published on Atlan’s community Substack useful:

Context & Chaos

Context Graphs as AI Evaluation Infrastructure

25 days ago · 11 likes · Sandipan Bhaumik

Ask your friends to join.
More valuable content coming your way.

Share agentbuild.ai

Thanks for reading agentbuild.ai! Subscribe for free to receive new posts and support my work.

I was offline. Here's what happened when I came back.

Sat, 11 Apr 2026 13:31:27 GMT

Hello everyone,

I owe you an explanation for going quiet last Saturday.

We took an Easter break as a family - properly offline, no laptop, in the English countryside. It was superb, the rolling green fields actually delivered on the promise. - and it was sunny ☀️

Right. I’m back. And there’s quite a lot to catch you up on.

I was at an AI engineering conference this week

And I gave a talk, it will be out soon. This is the first time it happened in Europe and I got to meet so many smart, talented founders and engineers. It was awesome experience. In-person events are irreplaceable.

I attended some fabulouse sessions on cutting-edge stuff on AI. And of course OpenClaw dominated the discussion.

Check out the conference here: https://www.ai.engineer/europe

I also made it to the online track of the conference

And the topic was something I’ve been working towards for a while - Multi-Agent Orchestration Patterns for Production.

The core argument: the field is moving fast, but most teams hit the same wall. They build multi-agent systems like they built single-agent systems. Same assumptions, same trust in the “it works in the demo” signal. And then production arrives, and nothing holds.

The talk walked through choreography vs orchestration, immutable state patterns, circuit breakers, and why distributed systems thinking is no longer optional if you’re building agents at any meaningful scale.

Check it out.

A piece I wrote just went live

This one has been in the works for a while, and I’m genuinely proud of it.

I wrote a guest article for Context & Chaos introducing two concepts I’ve been developing from my work with regulated enterprises: context drift and the evaluation graph.

When an AI system gives you an answer, that answer wasn’t produced in a vacuum. It was produced against a specific version of your world - a specific definition of what “active customer” meant that week, a specific policy that was in force that month, a specific dataset that may or may not still exist.

Think of it like this: imagine a doctor’s notes. It’s not enough to record what prescription they wrote. You also need to know what guidelines were current that day, what the patient’s history showed at that point, what the lab results said. Without that context, the notes are incomplete. Enterprise AI has the same problem. We’re recording the prescription. We’re not recording everything else that informed it.

This is original IP, and I think it’s going to become a recurring theme in how regulated industries think about AI governance.

Here is the article:

Context & Chaos

Context Graphs as AI Evaluation Infrastructure

a month ago · 9 likes · Sandipan Bhaumik

New YouTube video: The AI Latency Stack

While you’re in content-consumption mode this weekend, I also want to point you to a video I put out recently on AI application latency - it keeps coming up in conversations and I wanted to have something concrete to point people to.

The short version: after your AI system ships to production, the model is almost never the problem. It’s a set of architectural decisions - streaming, database writes on the critical path, cold starts, context window bloat, prompt caching, sequential calls that should be parallel - that compound into something that makes users give up and go back to the manual process. The video walks through seven of these layers and how to address them, without swapping models or changing vendors.

Worth a watch if you’re anywhere near a production AI deployment right now.

That’s it for this week.

A lot happened in a short space of time, and I wanted to share it with you directly. As always - reply if anything resonates, or if you’re wrestling with something I touched on.

Talk soon,
Sandi

Ask your friends to join.
More valuable content coming your way.

Share agentbuild.ai

Thanks for reading agentbuild.ai! Subscribe for free to receive new posts and support my work.

The High Agency Engineer Will Win the AI Era. Here's What I'm Seeing in the Field.

Sat, 28 Mar 2026 14:31:07 GMT

I’m seeing something in the field right now that is genuinely opening my eyes.

I’m lucky. My job puts me in front of a lot of engineering teams across a lot of organisations. Some are moving fast. Some are moving slow. And I get to see both. Not from a distance, up close, in the actual conversations where decisions get made.

What I’m watching is a quiet split happening inside engineering teams. And I think it matters for anyone thinking about where this profession is heading.

Two engineers. Same company. Same tools available. Same access to AI. Completely different outcomes.

One of them, when they hit a hard problem, opens a chat window and starts working through it out loud. They dump in the messy context. The half-baked question. The data that doesn’t quite make sense yet. They’re not looking for autocomplete. They’re looking for a way through.

The other one says “AI isn’t reliable enough for this.” And goes back to doing it the slow way.

I’ve watched this play out across banks, fintechs, and large regulated enterprises. And the gap between these two engineers is only getting wider.

What high agency actually looks like

I was working with a team recently trying to make sense of a large pile of unstructured documents. Audit logs, policy docs, historical reports. This kind of work that normally takes weeks of someone’s time.

One engineer on the team didn’t wait to be told how. She had no prior experience with the specific tooling. But she sat down, broke the problem into pieces, and used AI to work through each one. By end of day she had something working. Not perfect. But working.

She didn’t have a playbook. She made one.

That’s what high agency looks like in practice. Not waiting for a process document. Not waiting for someone to say it’s approved. When they hit a wall, the first instinct is to figure out what question to ask - not explain why the wall is there.

What the resistance sounds like

I want to be careful here. The engineers pushing back on AI are not lazy. Many of them are the most experienced people in the room.

But the resistance has a pattern.

“It hallucinates too much for our use case.”

“Security hasn’t signed it off yet.”

“The outputs aren’t consistent enough to trust.”

“This is hype, let it settle.”

Some of these are valid. I work in regulated environments. I understand the constraints.

But what I notice is this. The engineers saying these things have usually not given AI their hardest problem. They’ve given it easy tasks, watched it stumble, and concluded it isn’t ready. They’re evaluating a tool they haven’t really pushed.

The high agency engineers hit the same limitations. They just treat them as constraints to work around, not reasons to stop.

There’s something underneath the resistance

I think it goes deeper than technology skepticism.

A lot of experienced engineers have built their identity around already knowing the answer. They’re the person people come to. The one who’s seen this before.

AI is uncomfortable for that identity. Because the value is shifting. It’s moving away from already knowing - toward knowing how to ask. That’s a different skill. And it asks you to be a beginner again, at least partially.

The engineers I see thriving have a looser grip on what they already know. They’re curious before they’re skeptical.

They pick the tool up before they critique it.

One practical thing

Ask yourself honestly: when did you last give AI your genuinely hardest problem?

Not “summarise this document.” or “tidy up this function.”

The real hard thing. The one you’ve been circling because you don’t quite know where to start.

I’ve seen engineers use AI to compress weeks of analysis into a day. I’ve seen it catch patterns in production failures that a team had been chasing for months. I’ve seen it unlock a business conversation that had been stuck for a quarter - just by helping someone structure their thinking clearly enough to explain it.

None of that happened because the technology was perfect. It happened because someone decided to figure it out.

The job of an engineer is changing. I’m watching it happen. The ones adapting aren’t the most experienced or the most technical. They’re the ones most willing to stay curious.

That’s the only practical advice I have.

One question before you go - what's the most interesting thing you've seen an engineer do with AI that nobody is talking about yet?

Hit reply and tell me.

Talk soon,
Sandi

Ask your friends to join.
More valuable content coming your way.

Share agentbuild.ai

Thanks for reading agentbuild.ai! Subscribe for free to receive new posts and support my work.

NVIDIA GTC 2026: From GPUs to AI Factories - What Vera Rubin Really Means for Builders

Sat, 21 Mar 2026 14:30:43 GMT

Hello everyone,

Finally, spring is here, few sunny days here in England (I don’t want to jinx it though). Overall I am feeling happy, trying to get back to my running habit now. The disappointment I carried last week though was not being able to attend the NVIDIA GTC - I have too much going on to make a trip to the US right now.

Anyway, I have been following all the updates. I have collected the top things you should know in this newsletter.

Let’s have a look.

Folks, GTC 2026 was the week NVIDIA stopped selling us GPUs and started selling us AI factories - hardware, agents, and even token budgets included. Lovely stuff.

For years, GTC keynotes have been about bigger chips, more FLOPs, and eye‑watering benchmarks. This year was different. Mr. Jensen Huang’s message was clear: the center of gravity is moving from individual accelerators to full‑stack “AI factories” that ingest data on one end and ship intelligence on the other.

1. At the heart of that story is Vera Rubin.

Vera Rubin is an integrated platform: seven specialized chips, multiple rack‑scale systems, a supercomputer, orchestration software, and a roadmap to the next platform, Feynman. If Blackwell was the engine, Rubin is the entire plant. You don’t just get more TFLOPs; you get an opinionated way to build and run agentic systems at scale.

That framing matters if you’re an AI, ML, or data engineer.

Instead of asking “How do I get access to H100s or B100s?”, the real question becomes “Where will my AI factory live, and what will it produce?” That’s a very different conversation about architecture, data, and economics.

2. The trillion‑dollar AI factory build‑out

Mr. Huang also did something subtle but important: he didn’t talk about AI as a feature; he talked about AI as infrastructure. The combined order pipeline he referenced for Blackwell and Vera Rubin runs into the trillion‑dollar range over the next few years. Whether you believe in the exact number or not, the signal is unmistakable.

We’re no longer in the “let’s try a model” phase. We’re in a multi‑year build‑out of AI plants in the same way we once built data centers, clouds, and mobile networks. That means:

Inference economics become a first‑class design constraint.
Token budgets will be as real as laptop or SaaS budgets.
Capacity planning for AI will look more like power and networking planning than like a one‑off POC.

If you’re building products, this is your wake‑up call to treat AI like infrastructure, not a sprinkle of magic dust at the end of a roadmap.

3. The “agentic moment” is now official

Another clear shift: NVIDIA is leaning hard into agentic systems.

NemoClaw and its surrounding tooling were positioned as core to how enterprises will build with these new platforms. The pattern is no longer “one giant model behind an API.”

It’s:

Tool‑using agents orchestrating calls into models and services.
Multi‑step workflows that reason, plan, and act.
Customization and fine‑tuning on your own data, running on your own slice of an AI factory.

Practically, that means agent orchestration, evaluation, and safety move from hacker‑weekend topics to board‑level concerns. It also means AI and data teams who understand tools, context, and control flows will be disproportionately valuable.

4. Hardware envy and the pace problem

There’s a less comfortable undercurrent to all of this: hardware obsolescence.

If you invested heavily in last year’s “AI factory,” GTC 2026 probably gave you a twinge of regret. Rubin‑class systems move the goalposts again. Throughput, efficiency, network architecture - everything just jumped.

Most teams won’t be able to rip and replace every cycle. So the question becomes: how do you architect for optionality?

(By the way, this applies to any production-grade AI system architecture)

A few practical edges:

Design around portable abstractions (containers, standard runtimes, open protocols), not vendor‑specific stuff.
Separate concerns: data platform, model platform, agent layer. You want the freedom to swap pieces as the hardware evolves.
Focus on investments that survive GPU generations: data quality, evaluation, governance, and product integration.

The platforms will keep getting better.

Your moat will be how quickly you can adapt your stack to whatever comes next.

5. Data and the physical world reclaim the spotlight

One of my favorite subplots from this GTC is that structured data, simulation, and physical AI quietly stepped into the spotlight.

DLSS 5 and the new wave of neural rendering aren’t just about prettier video games. They’re about real‑time, photorealistic, physics‑aware environments you can use to train and validate agents. Combine that with better edge hardware and you get a serious push toward robots, industrial agents, and AI systems that interact with the messy real world.

Check this out:

For data people, the implication is simple: tables, events, and logs are still the fuel. For AI engineers, simulations and digital twins are becoming as important as datasets. For product teams, the bar for “realistic” behavior in AI‑powered experiences just went up.

Why this GTC matters for you

If you strip away the marketing, GTC 2026 is telling builders three things:

The unit of competition is shifting from model to factory.
Agentic systems will be the default pattern for serious AI products.
The compounding advantage still comes from data, evaluation, and integration - not just chips.

If you’re in AI, ML, or data, your edge will come from how fast you can align your architecture, practices, and skills with that reality.

What you can actually do next (without a Rubin cluster)

Most of us are not spinning up NVIDIA Vera Rubin systems next quarter. The realistic move is to upgrade how you think, learn, and design.

Here are four learning objectives you can pursue right now:

Think in “AI factories,” not just models
Map your current stack - data collection, feature engineering, model training, deployment, monitoring - against the AI factory idea. Where is data still manual? Where is evaluation an afterthought? Where are agents bolted on instead of designed in from the start?
Get comfortable with inference and token economics
Even if you’re using cheap or free models, start tracking tokens‑per‑feature and cost‑per‑request. A simple spreadsheet or dashboard that shows “this feature costs X per 1,000 users” will change how you design prompts, choose models, and argue for optimizations.
Practice building small, robust agentic flows
Use open‑source frameworks or your favorite LLM stack to wire up basic agents: retrieval + tool calling + simple planning. Focus less on exotic models and more on reliability, evaluation, and clear boundaries for what the agent should and shouldn’t do.
Re‑center your work on data, evaluation, and simulation
Treat your tables, logs, and events as the core asset, not an afterthought. Experiment with offline evaluation harnesses. If your domain touches the physical world, explore simple simulation or synthetic scenarios - even if you’re not using Omniverse‑grade tools yet.

This is too long already folks. I will stop here.

Thank you for reading, and please leave your comments, feedback. Get in touch, tell me what you are leanring and what you would like to know more of.

Ask your friends to join.
More valuable content coming your way.

Share agentbuild.ai

Thanks for reading agentbuild.ai! Subscribe for free to receive new posts and support my work.

How Do You Test AI - Practical Talk on AI Evaluation Approaches

Sat, 14 Mar 2026 14:31:07 GMT

Hello everyone,

Last week, I sat down with Hamel Husain for the AgentBuild podcast. Hamel is one of the most influential voices on AI evaluation in the industry right now - the kind of person that other experts quote when they’re trying to explain something difficult. The conversation was one of the most practically useful I’ve had this year, and I want to share the best of it with you.

Who is Hamel Husain?

Hamel is a machine learning engineer with over 20 years of experience. He’s worked at Airbnb and GitHub - where his early LLM research contributed to what eventually became GitHub Copilot. He has led and contributed to popular open-source ML tools, and today he’s an independent consultant who has helped more than 35 organisations build real-world AI products that actually perform in production.

He co-teaches AI Evals for Engineers and PMs on Maven, a course with over 3,000 students from 500+ companies - including teams at OpenAI, Anthropic, and Google. He also writes one of the most substantive technical blogs in the AI space at hamel.dev, and is co-authoring an O’Reilly book on the subject - Evals for AI Engineers.

He is, in short, the person you call when your AI system’s quality is a mystery.

The biggest misconception about AI evaluation

I asked Hamel what the most common mistake is when enterprises approach evaluation. He didn’t hesitate.

“The biggest misconception is that evaluation is as easy as going to a vendor that will give you off-the-shelf metrics in a dashboard. You plug it in, and poof - you’ve done eval. You’ve checked the box.”

He’s seen this play out more times than he can count. The team wires up a platform. They get a dashboard with coherence scores, faithfulness scores, toxicity scores. Everyone feels good for the first week. And then, slowly, people start to realise: those numbers don’t mean anything. No one knows what they’re measuring. They can’t tell if the product is getting better or worse.

Generic metrics don’t correlate to what matters for your specific application. A hallucination score doesn’t tell you if your legal AI is giving dangerous advice. A coherence score doesn’t tell you if your scheduling assistant is actually booking the right slots.

As Hamel puts it: “When I see a company showing me only generic metrics, I already know they’re in trouble. There’s a direct correlation between generic dashboards and teams that feel lost.”

Foundation model evals vs. product evals

There’s a second source of confusion worth naming. When people hear the word ‘evaluation’, many think of the benchmarks that model providers publish - things like MMLU, SWE-Bench, HumanEval. These are foundation model evals. They measure the general capabilities of a model at large. They have almost nothing to do with how well your AI product performs for its specific purpose.

Hamel’s analogy is the one I’ll keep using: Foundation model evals are like a standardised test score. Product evals are like job performance. Your SAT score tells you very little about whether you’ll be a good engineer. The gap between the two can be enormous.

What matters for the enterprise is the second kind. Not ‘how capable is GPT-5 generally?’ but ‘is our claims-processing assistant handling edge cases correctly, and can we measure that consistently?’

Why traditional QA isn’t enough

One of the most common objections I hear from customers: “We already have a QA team. Why can’t they just test the AI?”

Hamel’s answer is direct: treating AI evaluation like traditional software testing is a fundamental mistake.

The reason is determinism. Traditional software has deterministic outputs. You write a unit test: given input X, expect output Y. Pass or fail. With AI, the output is stochastic by design. The system is explicitly built to produce varied responses. You can’t write a unit test for a stochastic system the same way.

What you need instead is something that already exists - but that most AI teams have left behind: data science thinking.

Data scientists have been measuring stochastic systems for decades. They know how to sample, analyse, spot patterns, and design experiments that account for variability. That entire discipline is exactly what’s needed for AI evaluation. We just need to adapt it slightly for LLMs.

Hamel summarises it simply: “Evals are essentially data science for AI.”

The eval loop - how it actually works

Before I go further, let me show you the structure of a proper evaluation loop. This is what we explored together on the podcast earlier this week:

The loop is simple in principle. You feed test inputs into your model. You compare what it produces to what you wanted. You score the gap. And you use that signal to decide what to change - the prompt, the model, the data, or the test cases themselves.

The hard part is what lives inside the “evaluator” box. There are four types of evaluators, each with different trade-offs:

Exact match - Deterministic scoring. The output must match the expected answer. Fast and cheap, but brittle. Works well for classification, SQL, and structured outputs. Falls apart when the answer is right but worded differently.
Heuristic - Rule-based checks. Regex patterns, keyword presence, schema validation, length constraints. Good for catching structural failures. Can’t evaluate meaning.
Human review - Real people read outputs and rate them. The highest nuance. Also the slowest and most expensive. Essential for calibrating everything else, but doesn’t scale to thousands of daily outputs.
LLM-as-judge - A second AI model evaluates the output against a rubric. Scales well, handles open-ended responses, captures nuance that heuristics miss. But it inherits the judge model’s biases and blind spots. Requires calibration against human labels.

In practice, mature teams use all four. Exact match and heuristics form the fast, cheap baseline. LLM-as-judge handles scale on open-ended outputs. Human review calibrates the judge periodically.

The most powerful habit in AI development

Here’s where Hamel said something that sounds counterintuitive - and that I think is the single most important insight from our conversation.

The highest-value activity you can do when building an AI product is to sit down and look at your data.

Just open your traces, read actual outputs, and write down what you see.

When most people hear this, something in them resists. “In the age of AI, you’re telling me to open a spreadsheet and read individual data points? That can’t scale.”

But Hamel has done this with more than 50 companies. Every single time, people discover it’s not just useful - it’s transformative. They find unexpected failure modes. They identify bugs they didn’t know existed. They develop intuition that no automated system would have surfaced.

And there’s a second reason it matters: looking at data is how you elicit your own requirements. This is what Hamel calls criteria drift. You can write a specification about what a good product looks like. But it’s only when you see real user interactions that you understand what “good” actually means for your context. The process of reading real outputs and writing down what you observe is the process of transferring your taste to the system.

His practical starting point: aim to read at least 100 traces. 100 is not a magic number, it’s a concrete goal that gets people started. Keep reading until you reach what he calls theoretical saturation - the point where new traces aren’t revealing new failure patterns. In his experience, people rarely want to stop.

Who should own evaluation - and how to champion it

The other question I pressed Hamel on was organisational. In a large enterprise with 50 AI use cases in the pipeline, who owns this? Is there a central eval function? Does it sit with a team or a role?

He’s a strong advocate for bottom-up adoption, not centralised mandates. Top-down approaches - “our platform will standardise eval across the organisation” - tend to produce checkbox compliance. Teams grudgingly report metrics nobody understands.

What actually works: start needs-based. One team, one product, one clear problem they’re trying to debug. Embed the evaluation practice into the building process - not as a separate audit step, but as part of how they iterate.

Evaluation must be owned by domain experts, not outsourced to a QA or engineering team. If it’s a legal assistant, a lawyer needs to be in the loop. If it’s a clinical tool, a clinician. The domain expert is the only person with the taste and judgment to say whether an output is genuinely good. The engineering team can build the infrastructure, but the ground truth comes from the domain.

As for championing it internally? Hamel’s advice is the same I’d give for any new practice: don’t sell the methodology. Sell the results.

Don’t walk into a meeting saying “we need to do evals.” Walk in with findings. Show the error rate you found. Show the specific failure pattern you fixed. Show how you caught a regression before it went to production. Once people see that you consistently know more about what the product is actually doing than anyone else in the room, they’ll ask how.

What this means for your enterprise AI programme

If I were to distil everything Hamel shared into the things you can actually act on this week, here’s how I’d frame it:

Resist the pull of generic dashboards. If your only eval metrics are off-the-shelf coherence and faithfulness scores, you’re not evaluating - you’re performing evaluation. The metrics that matter are the ones you derive from looking at your own system’s failures.
Spend 30 minutes this week reading traces from one of your AI systems. Don’t automate it yet. Just read. Write down what you notice. You’ll find things no algorithm would have flagged.
Identify your domain expert. For every AI use case you’re building, there should be a person who has the authority and the proximity to say whether an output is good. That person needs to be in the loop on evaluation, not just the engineering team.
When you want to bring others along, don’t present a methodology deck. Present results. Show the before and after. Lead with what changed for the user or the business, and let the process speak for itself.

If you haven’t listened to my conversation with Hamel yet, I’d encourage you to. It’s one of the best discussions I’ve had on what it actually takes to build AI systems that hold up in production - not just in the demo.

I’d love to know what resonated. Reply to this email or comment here - looking forward.

Talk soon,
Sandi

Ask your friends to join.
More valuable content coming your way.

Share agentbuild.ai

Thanks for reading agentbuild.ai! Subscribe for free to receive new posts and support my work.

Big Companies Are About to Test Whether Their Employees Can Think Without AI. Here’s Why That Should Matter to You.

Sun, 08 Mar 2026 13:15:26 GMT

Happy International Women’s Day.

The future of AI is going to be shaped by the people who build it, question it, and decide how it gets used. Right now, women in tech are doing all three - often with less recognition than they deserve.

The field needs more of their voices, not fewer. If you know a woman in tech who is doing great work, today is a good day to tell her.

Now - to this week’s newsletter.

Gartner recently predicted something that made me think hard.

By 2026, they say, roughly half of large organisations will introduce what they’re calling “AI-free skills assessments.”

In plain English: companies are going to start formally testing whether their employees can still think, write, and solve problems without any AI help at all.

Not instead of AI skills. In addition to them.

When I first read that, I thought it was a bit extreme. But the more I thought about it, and the more I thought about what I see inside big organisations every day, the more I think they’re onto something real.

Last month I was in a meeting with a team trying to solve a difficult problem.

Someone suggested asking ChatGPT. So they did. The AI gave them a confident, well-structured answer. Everyone nodded and moved on.

I asked one of them afterwards: “Do you actually think that was the right answer?”

She paused. “Honestly? I don’t know. It sounded right.”

That’s the problem Gartner is trying to name. Not that AI gives bad answers. But that we’re losing the ability to tell whether the answer is good - because we’ve stopped forming our own opinion first.

It’s like becoming so reliant on GPS that you no longer have any sense of direction yourself. Fine, until the signal drops.

Here’s why this matters specifically if you’re learning AI right now.

You’re entering a world where everyone will have access to the same AI tools. The tools are getting cheaper and easier every month.

In two years, using AI competently won’t be a skill that sets you apart - it’ll just be the baseline.

What will set you apart is the judgment to know when the AI is wrong. The ability to ask a better question. The confidence to push back on an answer that sounds plausible but isn’t quite right.

Those things only come from practising thinking for yourself. And that’s the muscle that quietly atrophies when every task starts with “let me ask AI first.”

The people who will get the most out of AI are the ones who bring their own thinking to it - not the ones who outsource their thinking to it.

One small habit worth building now:

Before you open any AI tool for a problem, spend five minutes writing down what you actually think. Not a perfect answer. Just your honest first attempt - in your own words, your own logic.

Then bring in the AI. Compare. Push back where something feels off.

You’ll get dramatically better results from the tool. And you’ll keep the judgment sharp that makes those results mean something.

The companies Gartner is talking about will be testing for exactly that judgment. The good news is it’s not hard to build - it just has to be intentional.

I’m curious whether this resonates. Have you noticed yourself reaching for AI before you’ve really thought something through?

Hit reply - I read every response and it genuinely shapes what I write next.

Talk soon,
Sandi

Ask your friends to join.
More valuable content coming your way.

Share agentbuild.ai

Thanks for reading agentbuild.ai! Subscribe for free to receive new posts and support my work.

The 7-Step Playbook for Turning Any Business Process Agentic

Mon, 02 Mar 2026 02:00:17 GMT

Something happens in almost every meeting I’m in these days.

Someone opens a slide, or just starts talking, and within two minutes we’re deep into a conversation about models. Vendors. Which LLM is better for this use case. Whether to go with one orchestration framework or another.

Features. Availability. Pricing tiers. Token limits.

And I sit there. Listening. Waiting.

Then I ask something like: “What does success look like for this?”

Or: “How will you know in six months if this worked?”

The room usually goes a bit quiet. Sometimes people look at each other. Sometimes someone gives a vague answer about “efficiency” or “reducing manual effort.” And then, almost without fail, the tools conversation resumes.

I’ve stopped being surprised by this. But I haven’t stopped being bothered by it. Because the tool conversation feels productive. It has energy. People have opinions. There’s something to debate. Meanwhile the question of what you’re actually trying to achieve - measured, specifically, in a way you could verify, just sits there unanswered.

And then teams wonder why their agentic systems don’t survive contact with production.

This is the playbook I wish more of those meetings started with.

Subscribe now

First question: do you actually need multi-agent?

I ask this because nobody else does. The assumption in most rooms is that “agentic” means multiple agents.

Sometimes that’s right. If decisions in your process genuinely can’t coexist - different data access, different authority, different latency requirements, then splitting makes sense.

However, multi-agent systems are just distributed systems. And distributed systems are hard. When something breaks, you’re chasing a failure across boundaries, through handoffs, through tool calls you can’t always replay. That complexity doesn’t disappear.

Start single agent. Let the constraints of the actual process push you toward multi-agent if they need to.

Most of the time, the process doesn’t need it. The team just wanted to build it.

Where does the AI go, and where does the human stay?

The most common mistake I see: humans get put at the end. Final review. Rubber stamp before anything goes out. It feels like a safety net.

A human reviewing an output they didn’t generate, without the context that produced it, at the end of a chain they can’t fully see - that’s not oversight. That’s decoration.

AI belongs where errors are recoverable. Humans stay where they’re not.

A compliance violation, an irreversible action, something you’d find out about through an audit three weeks later - those stay human until the system has earned the right to handle them.

You don’t decide in a design session that the AI can handle something. You prove it. Slowly. With data.

The human doesn’t leave the loop because the architecture says so. They leave because the evaluation says it’s safe.

Does the whole process need to go agentic?

Almost certainly not. But the pressure to say yes is enormous right now.

The ROI case always gets built for the full process. End-to-end automation, scale infinitely. That’s how it gets approved. And then reality shows up - the data isn’t ready, the decisions aren’t defined clearly enough, and nobody can tell whether any of it is working.

What I’ve seen to work: find the one or two decisions where human time is most expensive or delay is most painful. Start there. Leave the rest human for now.

The process you want to automate probably wasn’t that well-designed to begin with. AI will find every shortcut, every undocumented exception, every “we just know” that your team built into it over the years. Automating the whole thing at once means hitting all of that simultaneously.

Pick one decision. Define it properly. Prove it works. Then move.

The 7-step playbook

This is the Reverse Strategy Framework applied to process agentification.

The order matters.

Each step is a gate - if you can’t pass it, you’re not ready for the next one.

1: Map the decisions, not the steps.

Get the people who actually run the process in a room. Have them document the judgment calls. At every point where a human exercises discretion: what are they looking at, what makes them go one way versus another, what would a wrong call look like, and how quickly would you know?

Ask: if you gave two experienced people the same input, would they make the same call? If they regularly don’t - you don’t have a process you can automate.

You have a process you need to design first. You can’t build an agent to make a decision the organisation hasn’t agreed on.

2: Define what ‘good’ looks like for each decision.

For each decision node you’re considering, map what precision do you need, what’s an acceptable error rate, what’s the cost difference between a false positive and a false negative? Quantify with numbers.

These aren’t metrics you figure out after you build. They’re the thing that tells you whether the system is working at all. Most AI projects fail because nobody defined what capable meant. Do it here, before anything else.

3: Check your data readiness, decision by decision.

Agents make inferences from data. For each decision node you want to automate, check whether the data exist, can an agent access it in real-time, and is it structured in a way the agent can reliably use?

Most enterprise processes run on data designed for humans. PDFs. Exports. Systems that require a login and three clicks. Context that only exists because someone’s been in the role long enough to know where to look.

Check five things per node: how accessible the data is, whether the schema is clean and consistent, whether there’s enough metadata for the agent to interpret what it’s looking at, how errors and edge cases are handled, and whether there’s any observability into what the data’s doing.

If a node’s data isn’t ready, build the data layer first. A good model on broken data is still broken.

4: Assign each decision - AI, Human, or Hybrid.

Using what you’ve defined in Steps 2 and 3: AI handles high-volume, well-defined decisions where errors are recoverable. Humans handle decisions where a wrong call is asymmetric and non-recoverable. Hybrid - AI proposes, human confirms - is for the middle ground, where you think it’s probably automatable but don’t yet have the data to prove it.

Write this down. It becomes the architecture contract. If someone later asks why the agent doesn’t handle a particular decision, the answer is already there.

5: Decide single vs. multi-agent.

You now have the decision map. Look at it. Are there nodes that genuinely require incompatible contexts - different data access, different authority, reasoning chains that need to be isolated from each other? Those are your split points. If not, stay single.

I have seen many teams start with this conversation. It actually belongs here, in Step 5, with real information in front of you.

6: Build the evaluation before you build the agent.

I know this feels backwards. Build the thing first, then measure it - that’s the instinct. Don’t do it. Please.

Before you write a line of agent code, collect 200+ real examples from the process. Actual inputs, paired with what a good human would have decided on each one. Then define how you’ll score whether the agent’s call matches that standard.

This forces a useful confrontation: can you actually define “correct” before the AI has to? Sometimes you can’t. That’s valuable to discover now rather than six months into production.

Evaluation isn’t a final checkbox. It’s the architecture that keeps the whole thing alive.

If you can’t build a golden dataset for a decision node, that node isn’t ready. That’s not a failure. That’s the process working.

Step 7: Shadow mode first. Then cut over.

Run the agent on live inputs in parallel with the humans. Humans keep making the real decisions. You compare outputs - systematically, against your golden dataset and against the human calls on the same inputs.

The edge cases that didn’t show up in your test set will show up here. They always do. Shadow mode is where you find them safely, without consequences, while building the evidence base that earns trust.

Cut over when the error rate threshold is met. Not when the launch date arrives.

The question that tells you if you’re ready

Before any of this starts, ask yourself one thing:

If you ran the human process and the agentic process side by side on the same inputs for 90 days, would you have the data to prove the agent is performing at least as well?

If yes - you’ve defined success, you have evaluation infrastructure, you’re ready.

If no - something foundational is missing. You haven’t defined success clearly enough, the data can’t support measurement, or you don’t have a golden dataset to compare against. That’s Evaluation Debt.

The process you want to turn agentic probably has a version that can work. Whether you get there depends on whether you’re willing to answer the hard questions before the tools conversation starts.

Most meetings I’m in, we never get there.

If you’re in the middle of this - mapping a process, arguing about scope, trying to figure out where the human stays in the loop - hit reply. Tell me where it’s stuck. I find these problems genuinely interesting could share my insights.

If you enjoyed reading this, please share with your friends. Leave a feedback, and tell me what you would like to read more of.

Thanks,
Sandi.

Ask your friends to join.
More valuable content coming your way.

Share agentbuild.ai

Thanks for reading agentbuild.ai! Subscribe for free to receive new posts and support my work.

Why the best developers are writing less code than ever

Sun, 22 Feb 2026 21:52:10 GMT

A conversation with Lena Hall, Sr. Director of Developer Relations at Akamai, ex-AWS, ex-Microsoft Research.

Something quietly shifted in the last 12 months.

The senior engineers moving fastest right now - the principal engineers, the architects - are spending 80% of their time writing specs, not code. They’re defining inputs and outputs, mapping component contracts, eliminating ambiguity before an agent ever touches the implementation.

If that makes you uncomfortable, it should. It means the game has changed - and what made you good yesterday may not be what makes you valuable tomorrow.

I sat down with Lena Hall to talk about this. Lena has built distributed systems at Microsoft Research, led developer relations across AWS, and now drives AI infrastructure strategy at Akamai. She’s watched this shift happen in real time.

“Code is becoming like binaries. We don’t manage binaries - they’re generated. Code is heading the same way. The question is: what does that make you?”
- Lena Hall

The Role Is Changing. Here’s What It’s Changing Into.

Developers aren’t becoming obsolete. They’re becoming architects - and the best ones are operating more like CTOs. They own the logic, the system design, the edge cases. They define what needs to be built with enough clarity that an AI agent can execute it reliably.

But here’s where Lena’s technical background adds a layer most people miss: AI agents are non-deterministic by nature. If you don’t control them structurally - with structured outputs, phased execution, and human checkpoints at high-stakes decision points - they will break in ways you can’t predict or explain.

She calls this pragmatic AI: architecture matched to the stakes of the business problem. Low-stakes tasks can tolerate some ambiguity. High-stakes tasks - financial decisions, healthcare workflows, anything with legal exposure cannot. The expert must be in the loop before the system ships, not after.

What You’ll Take Away From This Episode

Why the Two Generals Problem from distributed systems applies directly to every LLM call you make
The 3-tier framework for deciding how much AI control your use case actually needs
Why excluding domain experts from your AI workflow is the #1 mistake teams make
The one habit that separates developers who ship reliable AI from those who don’t: fix the spec, not the output

WATCH THE FULL EPISODE - AgentBuild Expert Exchange

Whether you’re writing code every day or managing teams that do - this one reframes how you think about where your value actually lives in the AI era.

Thanks, and please leave some comments on the video.
-Sandi.

Ask your friends to join.
More valuable content coming your way.

Share agentbuild.ai

Storytelling: The SKILL that’s quietly becoming more valuable than your technical chops

Sat, 14 Feb 2026 14:30:38 GMT

I watched one of the sharpest ML engineers I’ve ever worked with get passed over for a lead role last year.

His technical work was genuinely brilliant. The kind of thing that makes other engineers quietly jealous.

Then he presented it to leadership.

Forty-five minutes of architecture diagrams. Precision recall curves. Token-level breakdowns of embedding strategies. Every slide was technically correct... and completely forgettable.

The exec sponsor checked her phone twice. The VP asked one question: “So what does this actually mean for our customers?”

He stumbled. Not because he didn’t know - he absolutely did - but because he had never practised framing it as anything other than a technical achievement.

Someone else got the role. Someone less technically impressive, but who could walk into a room and make a CTO feel something about the work.

I’ve seen this pattern dozens of times now. Brilliant people, invisible impact. And it’s not because they lack skill. It’s because nobody ever told them that the story of the work matters as much as the work itself.

That gap is about about to get a lot more expensive.

And you need to pay attention.

This shift is already here

So here’s where it gets interesting.

The Wall Street Journal reported in December 2025 that LinkedIn job posts mentioning “storyteller” doubled in a single year. Not grew a bit. Doubled.

Let that sit for a second.

And it’s not just hiring. Executive mentions of “storytelling” on earnings calls hit 469 in 2025, up from 147 in 2015. That’s not a marketing trend. That’s a boardroom concept now.

So, you may ask why the spike?

You can probably guess.

AI made content cheap. Abundant, even.

Which made trust and human narrative scarce... and therefore valuable.

One communications CEO nailed it: the flood of AI-generated content created so much distrust that the brands winning right now are the ones that sound most human.

Here’s the weird side-effect if you’re in data or AI: your dashboards, models, and agents aren’t the final product anymore.
The story about them is.

How you actually get better at this

You don’t need to become a novelist. You just need to change how you frame what you’re already doing.

Frame everything as a before and after. Next time you present work, try this: Before - here’s how decisions were made, or what was broken.
Conflict - here’s the cost of staying like this.
After - here’s what changes if this works. One slide. Three beats. That’s it. That’s your story. Most people skip the conflict part, by the way - and that’s exactly the bit that makes execs lean forward.
Translate complexity into choices. Instead of “we used model X with technique Y,” try: “We chose this approach because it sacrifices a bit of accuracy for much better latency, which means customers don’t wait.” See the difference? You’re telling a story of trade-offs now. A VP can repeat that in a corridor. They can’t repeat your architecture diagram.
Anchor every number to a human. “3% uplift” is forgettable. “That 3% means 8,000 fewer customers hitting this error screen every month” - that sticks. Whenever you’ve got a metric, ask yourself: what does this number feel like for a real person?

Why bother when you could just get much better at technical stuff?

Fair question. Here’s my honest answer.

AI is eating the production side of our work. Fast.
Code, analysis, first drafts - all getting automated.
What it can’t replace is picking the right problem, reading the room, and crafting a narrative that makes someone with budget authority say “we’re doing this.”

That’s the bit you don’t want to outsource.

Here's what I do.
Every project I work on, I write a five-sentence story about it before I present anything: who it helps, what hurts today, what we're changing, how we'll know it worked, what happens next.

No fancy framework. Just five sentences.
It forces me to find the narrative before I open a slide deck.

Then next time you send a Slack update or a deck, check: is there a clear before and after? Is there one sentence someone could repeat to their boss?

If you can answer that... you’re already ahead of most technical people in the room. Because you got clearer.

And clarity, it turns out, is what actually moves organisations.

Thanks for reading. Tell me in comments - what you think about this new skill employers are looking for? How are you preparing for this shift?

Thanks,
Sandi.

Ask your friends to join.
More valuable content coming your way.

Share agentbuild.ai

Anthropics Timeline vs. Your 2006 Database

Sat, 07 Feb 2026 10:51:54 GMT

Dario Amodei, CEO, Anthropic says software engineers have 6-12 months left.

Meanwhile, I just left a meeting last week where a $1B+ company spent 90 minutes arguing about whether “active customer” means someone who bought in the last 30 days or 90 days.

Different teams.

Different definitions.

Different databases.

These are not the same timeline.

Image Credit: artificialintelligence.co

The Anthropic Reality

At Anthropic, engineers stopped writing code because their stack was built for AI from day one. Clean data. Modern architecture. No legacy anything.

The Enterprise Reality

At most companies, the Head of Data is still explaining why you can’t just “put everything in a vector database” when nobody agrees on what an active customer is.

This isn’t about AI capability. It’s about infrastructure debt that’s 15 - 30 years deep.

The Real Gap

Most of my time with customers these days are spent on talking about AI Readiness.

You know what blocks them? Not models. Not talent. Not budget.

It’s that nobody can answer basic questions:

Where is the source of truth for customer data?
Can we access it in real-time or only batch?
Do we have lineage? Do we have versioning?
Can we trace decisions back to their data sources?

AI-native companies designed from scratch for these questions.

Everyone else is retrofitting.

The Opportunity in the Gap

Here’s what’s interesting about this moment.

AI companies built for a world that doesn’t exist yet.

Enterprises are still operating in the world that does.

Someone has to bridge that gap.

And right now, AI companies are realizing they can’t do it alone. Anthropic can’t retrofit your 2009 database. OpenAI can’t untangle your customer data across 12 systems.

This is where the real opportunity is.

Not in building better models.

In building the infrastructure that lets models actually work.

If you know:

How to design cloud infrastructure
How to build data platforms
How industry-specific workflows actually operate
How to translate technical requirements to business outcomes

You’re not behind. You’re exactly where the market needs you.

But here’s what you need to learn - and learn fast:

The gap between what models can do and what enterprises can actually deploy.

Because that gap is the entire business for the next 5 years.

Most people think they need to learn prompt engineering or RAG architectures.

What they actually need to learn is why a large enterprise can’t answer “what’s this customer’s balance?” without a nightly batch job.

And how to fix that before the AI even shows up.

The window’s still open. But it’s closing fast.

Not because AI is getting harder. Because the people who understand both worlds - AI capability AND enterprise reality - are getting snatched up.

The question isn’t whether you can catch up to Anthropic.

The question is whether you can help enterprises catch up to what AI requires.

That’s the real opportunity.

Next Week

I am releasing a plan for you to prepare for this opportunity.

Make sure you don’t miss that.

Ask your friends to join.
More valuable content coming your way.

Share agentbuild.ai

When Your AI is Quietly Failing

Sat, 31 Jan 2026 14:30:24 GMT

Few months ago I meet a VP of Engineering of a SaaS Startup building document processign solutions. He said, “We shipped our AI six months ago, my team spends most of the time fire-fighting issues - I am not sure we know what’s wrong with it.”

They had no measurement infrastructure. No test cases. No way to trace decisions. They’d celebrated the launch, moved the team to the next project, and now they are spending most time resolving issues - patching, and fixing.

He’s not alone. I have similar discussions with many companies who shipped thier AI features under pressur - from boards, from investors, from competition.

I keep seeing this pattern

Three times in the last quarter, I’ve worked with organizations dealing with the same crisis:

Scenario 1: This company deployed document processing AI to 10 enterprise customers. Six months in, their largest customer threatened to cancel. The AI was extracting wrong data from contracts. The team had no way to see why.

Scenario 2: A fintech launched an insurance documentation assistant. Their customers complained it was “getting slower” and “less accurate.” Thier team couldn’t verify either claim - they’d never established baselines.

Scenario 3: A retail bank deployed a chatbot handling internal claim disputes. Support tickets about “wrong AI answers” were climbing. Nobody could trace which policy the AI used or why it made specific decisions.

Here’s the common thread I noticed?

All three had deployed without building evaluation infrastructure.
They’d built the AI, but not the system to know if the AI was working.

What Actually Happens in These Meetings

I sit in a conference room with engineering, product, and business leaders. I ask:

“What’s your accuracy?”

“X%”

“What’s it costing per query?”

“Infrastructure costs are $X, but we don’t track per-query.”

“Do you have test cases?”

“We tested it before launch...”

“Can you show me traces of what went wrong?”

Silence.

This is what I call Evaluation Debt.

You deployed a system without building the measurement infrastructure to operate it. And now you’re paying interest - in firefighting, guessing, and eroding stakeholder trust.

The Recovery Framework I Use

Here’s what most advice gets wrong: it assumes you’re starting from scratch. But you’re not. Many AI POCs mde it to production - just that they were not production-ready. They need a recovery framework, not a startup guide.

I’ve developed a four-phase Recovery Pathway that rescues these systems without rebuilding from scratch:

Define what success should have been
Build measurement infrastructure retroactively
Diagnose with data, not guesses
Fix and validate systematically

That SaaS company? Itook them from 73% accuracy and customer threats to 96% accuracy and $1.4 million in annual savings.

Not by switching models.

By implementing evaluation infrastructure and working systematically.

Why This Matters Now

According to recent IDC research, enterprises are collectively spending $154 billion on AI initiatives in 2024. But McKinsey data shows that only 11% of organizations have achieved significant financial returns from their AI investments.

The gap between investment and return isn’t a capability problem.

It’s a measurement problem.

You can’t fix what you can’t measure.

And you can’t defend an investment you can’t prove is working.

Watch the Recovery Pathway Framework Video

I just recorded a complete walkthrough of the Recovery Pathway framework. -including a real case study where we rescued a failing AI system.

👉 I will be bringing more practical content like this - please subscribe to my channel if you want to stay updated. I share clips and short formats so you can learn something new everyday.

You’ll see:

The exact diagnostic process that reveals where failures are happening
How to implement tracing retroactively without rebuilding
The week-by-week action plan to go from crisis to recovery
Real numbers: $4.20 per document down to $1.80, 47 complaints per month down to 3

This isn’t theory. This is the actual process I use when organizations call me to rescue production AI systems.

If you’re dealing with an AI system that shipped but isn’t delivering the value you promised - this framework will show you the way out.

And if you know someone firefighting a struggling AI deployment, share this with them. Recovery is possible. But it requires working backward with discipline.

👉 BONUS: I’ve created a Recovery Pathway checklist with the workshop agenda, tracing implementation guide, and diagnostic framework.

Get in touch if you have questions.

Found this useful? Ask your friends to join.
We have so much planned for the community - can’t wait to share more soon.

Share agentbuild.ai

You Don’t Need To Code To Win In AI

Sat, 24 Jan 2026 15:01:57 GMT

“I need to learn Python first.”

I hear this constantly. Smart people - product managers, operations directors, strategists - convinced they’re locked out because they can’t code.

And I get it. Every conference is engineers talking models. Every job posting says “Python required.” The whole industry makes you feel like you need a CS degree to have opinions.

But here’s what nobody says out loud: the people making AI actually work? Half of them can’t code.

Let me tell you about Sarah.

She runs customer success at a fintech. No engineering background. Can’t code.

Her team’s AI chatbot was escalating 40% of conversations to humans. Engineering loved their metrics - great uptime, low latency, all good.

Sarah just... read the transcripts.

Customers were asking “What’s my balance?” five different ways. The AI only recognized one phrasing. Everything else got escalated.

She made a list. Sent it to engineering. They fixed the prompts.

Escalations dropped to 12%.

No Python. Just pattern recognition from years of reading support tickets.

Or Marcus.

Finance guy reviewing an AI project that would auto-approve transactions.

One question: “What’s our liability if this approves something it shouldn’t?”

Engineering hadn’t thought about it. Legal wasn’t in the room. No audit trail, no rollback, nothing.

Project paused. Controls added. Compliance disaster avoided.

Marcus can’t tell you the difference between GPT-4 and Claude. Doesn’t matter. He knows business risk.

Here’s the pattern:

Companies are betting millions on AI based on... what?

Engineering enthusiasm. Vendor pitches. FOMO.

Six months later they’re wondering why it’s not working.

Usually because nobody asked:

“Which actual problem are we solving?”
“How will we know if this works?”
“What happens when it’s wrong?”

These aren’t technical questions. They’re judgment questions.

And you’ve been building that judgment your whole career.

What companies desperately need:

Someone who can translate. You know your business, your customers, what breaks in production. That’s incredibly valuable for evaluating AI - you just need confidence to use it.

Someone who can smell bullshit. When a vendor says “95% accurate,” you ask “at what? measured how?” That’s not coding. That’s skepticism.

Someone who thinks in systems. AI doesn’t exist alone. It connects to databases, changes workflows, affects people. Understanding those ripples? That’s experience, not Python.

What you actually need to learn:

The boundaries. AI is great at patterns, terrible at novel reasoning. It summarizes well but hallucinates confidently. Needs good data or amplifies garbage.

The right questions:

“How are you evaluating this?”
“What error rate are we comfortable with?”
“Who’s responsible when it’s wrong?”
“How do we know if it degrades?”

That’s not rocket…errrm, computer science.

How to spot vaporware. The AI space is full of companies slapping “AI-powered” on regular features. Your job isn’t building models. It’s protecting your company from expensive mistakes.

If you’re waiting to “get technical enough”... stop.

Pick one AI project at your company. Ask three questions:

“How are we measuring if this works?”
“What happens when it’s wrong?”
“How will we know if it stops working?”

Can’t answer clearly? You just found where you’re needed.

No Python required.

Just business sense and courage to ask uncomfortable questions.

The future of AI isn’t just technical.

It’s people who bridge what’s possible and what’s useful.

Between demos and systems that work when everything’s on fire.

Between vendor promises and reality.

That’s you.

You don’t need permission. Don’t need a bootcamp.

Just show up and ask the questions engineers aren’t asking.

They’re making it work. You make sure it’s worth working on.

What’s stopping you from asking those questions?

Found this useful? Ask your friends to join.
We have so much planned for the community - can’t wait to share more soon.

Share agentbuild.ai

The 5 Levels of Context: Moving From "Telepathy" to "Glassbox" AI

Sun, 18 Jan 2026 14:03:13 GMT

Hi everyone,

In this week’s newsletter, I’m sharing a masterclass interview with Denis Rothman, author of Context Engineering for Multi-Agent Systems. If you are tired of AI systems that work in demos but fail in production, the video below is essential viewing.

We often treat AI like an oracle, sending vague prompts and hoping for the best. Denis argues that the days of “writing functions” are fading, and we must move from simple prompting to Context Engineering to build production-ready systems,.

Here are the key takeaways from our deep dive into the “glassbox” architecture of future AI.

1. The 5 Levels of Context Sophistication

Most users are stuck at Level 1, where they ask vague questions like “What would you like to drink?” and receive random, probabilistic answers. Rothman outlines a hierarchy to eliminate this entropy:

Level 1 (Zero Context): Random guessing based on training data.
Level 2 (Linear Context): Adding basic details (e.g., “It is 6 p.m.”).
Level 3 (Goal-Oriented): Defining what you are trying to achieve.
Level 4 (Role-Based): Assigning explicit roles and relationships.
Level 5 (Semantic Blueprints): This is the engineering level. It involves decomposing actions into semantic components - treating the LLM less like a person and more like a database to be queried with specific keys and values.

2. The Magic Word is “Specifications”

During the discussion, we uncovered that the essence of successful AI interaction isn’t “prompting”- it is providing specifications.

Just as you wouldn’t tell a flight attendant “I want a flight” without specifying a destination, you cannot expect an LLM to function without structured constraints,. Rothman advocates using the Model Context Protocol (MCP). By standardising how we speak to agents (using a universal structure), we can create systems that are testable, reusable, and scalable.

3. The Dual RAG Architecture

One of the most powerful concepts Rothman introduces is Dual RAG. Standard Retrieval-Augmented Generation often mixes everything together. A production-ready system separates them into:

• The Knowledge Base: The factual data (The “What”).

• The Context Library: A repository of semantic blueprints and instructions (The “How”).

This prevents engineers from rewriting the same complex prompts for months. Instead, you vectorize your best instructions and store them, allowing agents to retrieve how to do a task before they retrieve the data to do it with.

4. The “Glassbox” and Traceability

For enterprise adoption, “black box” AI is a liability. Rothman insists on a Glassbox architecture comprising a Planner, Executor, and, most importantly, a Tracer.

Traceability is crucial for legal and copyright reasons. If you cannot prove that your specific prompt design generated a marketing campaign, you may not own the copyright. A context engine allows you to trace the entire lineage of an output, proving human design in the loop.

5. Fix the Organization, Not Just the Code

Finally, a word of warning: Do not try to use AI to fix a disorganized company. If you automate a broken process, you just get broken results faster.

Rothman advises starting with “quick wins” by finding a domain expert who is overworked. Don’t build agents to replace them; build agents to reduce their 12-hour workday to four hours. This builds a reputation as a helper rather than a “killer” of jobs, ensuring organizational buy-in.

Watch the full video below to learn how to implement these concepts.

For those who want to go deeper, Denis Rothman is running a workshop on January 24th where you can build a universal context engine from scratch.

Found this useful? Ask your friends to join.
We have so much planned for the community - can’t wait to share more soon.

Share agentbuild.ai

The Production AI Manifesto

Fri, 02 Jan 2026 12:30:58 GMT

Introduction

Happy New Year, everyone. This is 2026.

AI capabilities have never been more powerful. Foundation models can reason, generate, and act. Every vendor promises transformation. Every conference declares a new era with a bunch of announcements.

And yet.

Exposed in these fields of progress lies an enormous graveyard. It holds the remains of the AI projects that were satisfactory in demos, celebrated at launch, and abandoned within a year. No one satisfactorily explained why they failed. No one actually could say whether they had ever really worked.

We have watched this cycle repeat across industries, company sizes, and technology stacks. The pattern is not random. The failures are not inevitable.

They are the result of building with “tools-first“ mindset.

This manifesto declares a different path. Not a new framework, not a new tool, not a new model. A discipline. A set of commitments about how production AI must be built.

I offer these principles to every practitioner who is tired of demos that don’t become products - and ready to build AI that survives contact with reality.

If you’re reading this, you might like to read:
2025: Lessons From The Trenches
Stop Building AI Systems You Can’t Measure
Artificial Intelligence ➛ Profitable Intelligence

Subscribe now

The Principles

I. Most AI initiatives fail not from lack of capability, but from lack of clarity about what success looks like.

II. You cannot automate a decision that was never defined. If the logic lives only in people’s heads, AI will not extract it. It will invent something worse.

III. Data built for humans breaks when machines consume it. Reports and dashboards tolerate ambiguity. AI amplifies it.

IV. Intelligence on top of inconsistency produces confident nonsense. Confident nonsense is more dangerous than no answer at all.

V. Three debts block the path to production: Data Debt, Decision Debt, Evaluation Debt. Your data is not ready, your logic was never specified, you cannot tell if the system works. These are not technical problems. They are clarity problems. No model will solve them for you.

VI. Evaluation is not a final checkbox. It is the architecture that determines whether your system can learn, improve, and survive.

VII. Define success before you build. Design measurement before you design systems. Choose tools only after you know how you will judge them.

VIII. If you cannot measure it, you cannot improve it. If you cannot prove it works, you cannot defend the investment.

IX. Observability is not overhead. It is the difference between debugging in hours and debugging never. Every decision your AI makes must be traceable to the data, logic, and context that produced it.

X. AI does not transform organizations. It reveals how untransformed they already are. Every gap in data, every inconsistency in decisions, every missing feedback loop - AI makes them visible and makes them hurt.

XI. The teams that succeed do not have bigger budgets or better models. They have clearer definitions of success and the infrastructure to measure it.

XII. We reject the theater of impressive demos. We commit to building AI that works - in production, under pressure, at scale, over time.

Conclusion

This manifesto is a line in the sand.

On one side: AI as spectacle. Measured by applause in meeting rooms. Validated by executive excitement. Declared successful at launch and never examined again.
On the other side: AI as engineering discipline. Measured by outcomes. Validated by evidence. Improved continuously because the architecture demands it.

We stand on the second side.

In the weeks and months ahead, I will publish the frameworks, patterns, and hard-won lessons that make this real. How to assess whether your data is ready. How to implement evaluation-first development. How to build systems you can observe, debug, and improve.

This is not theory. This is the work.

Join us if you are building AI that has to survive contact with reality.

Thanks,
Sandi.

Found this useful? Ask your friends to join.
We have so much planned for the community - can’t wait to share more soon.

Share agentbuild.ai

2025: Lessons From The Trenches

Sat, 27 Dec 2025 15:30:24 GMT

I started 2025 thinking I understood production AI.

Eighteen years in data & ML engineering. Spent time working in some of the most strategic, cutting-edge projects with F500 customers. Leading conversations with enterprises about data literacy, governance, “being data-driven“, and now AI adoption. I’d seen enough technology cycles to know the patterns.

But this year humbled me.

I watched the gap widen between teams who shipped and teams who stalled.

And the reasons weren’t what I expected.

These lessons didn’t come from research reports. They came from the conversations with leaders and their teams working with AI. From the data leader who finally said out loud what everyone was thinking. From the community discussions in AgentBuild where practitioners shared what was actually happening - not what their LinkedIn posts claimed.

Here’s what I actually learned.

1. Everyone was solving the wrong problem first

I had a call in March that changed how I think about AI projects.

A team had spent three months building an incredibly sophisticated multi-agent system. Beautiful architecture. Clean abstractions. The kind of thing that gets applause in a demo.

Then I asked: “How do you know if it’s working?”

Silence.

They’d built the entire system without defining what success looked like.
No evaluation criteria.
No baseline metrics.
No way to know if Version 2 was actually better than Version 1.

I started asking this question in every conversation after that. Maybe 20% of teams had a good answer. The rest were flying blind, hoping they’d know “good” when they saw it.

This is what pushed me toward what I now call the Reverse Strategy Framework - starting with evaluation before architecture, defining success before selecting tools. It feels backwards. But every team I’ve seen succeed this year did it this way, whether they had a name for it or not.

2. “We have a data problem” became the most honest sentence in enterprise AI

I used to hear teams blame models. Blame frameworks. Blame vendors.

This year, I finally started hearing the truth: “Our data isn’t ready.”

One conversation sticks with me. A data leader at a large financial services company- someone I respect enormously said: “Sandi, we’ve been talking about data quality for fifteen years. But AI is the first thing that made it impossible to ignore. Every failure traces back to the same place.”

That honesty is spreading. In community discussions, in customer calls, in the quiet admissions after the official meeting ends. The facade is cracking.

Here’s what I’ve come to believe:
Most AI failures aren’t AI failures. They’re data failures that AI made visible.

3. The teams that shipped weren’t the smartest - they were the most boring

I talked to a lot of teams this year who were building impressive things. Complex agent architectures. Novel approaches. Cutting-edge frameworks.

Most of them are still in pilot.

The teams that actually made it to production? Their architectures were almost disappointingly simple. Single-agent systems. Straightforward retrieval patterns. Minimal orchestration complexity.

One engineering lead told me: “We had to kill our egos. The sophisticated version was more fun to build. But the boring version was what we could actually operate.”

I’ve started using this as a diagnostic. When someone shows me an architecture diagram and I’m impressed, I get worried. When I’m slightly underwhelmed, I get optimistic.

4. Human oversight isn’t a training wheel - it’s a feature

Early in the year, I was in a room where someone said “the goal is to remove humans from the loop entirely.”

I’ve heard variations of that statement dozens of times since. And I’ve watched those projects struggle.

The teams that succeeded took a different view. They designed human touchpoints into their systems - not as temporary scaffolding to remove later, but as permanent architecture.

A compliance lead at a healthcare company put it perfectly: “The question isn’t when we can trust the AI to work alone. The question is where humans add the most value in the process. That’s not a limitation. That’s design.”

I’ve stopped seeing human-in-the-loop as a constraint. It’s a capability.

5. Trust compounds. So does distrust.

I had coffee with a CTO who’d just pulled an AI initiative that was technically working fine.

Why? The business didn’t trust it. One bad output early on - something the team had long since fixed - had poisoned the well. Users had stopped engaging. Stakeholders had lost confidence. The perception problem became unsolvable.

“We actually fixed the accuracy issue in week two,” she told me. “But it didn’t matter. We’d already lost them.”

This is the lesson I wish I’d understood earlier in my career: Trust isn’t something you earn once. It’s something you have to build from the first interaction. And once it’s gone, technical improvements don’t bring it back.

Every team I’ve seen scale successfully invested in trust infrastructure - observability, explainability, guardrails - before they invested in features. The ones who added it later were always playing catch-up.

6. The stack changed faster than anyone could learn it

I stopped counting the number of times someone told me they were “standardizing” on a framework, only to be evaluating alternatives three months later.

This isn’t anyone’s fault. The landscape genuinely moved that fast. New models. New capabilities. New frameworks. Best practices that were obsolete by the time they were documented.

One architect described it as “building on quicksand.” You make a decision, you start building, and then the ground shifts underneath you.

The teams that handled this best didn’t try to pick the “right” stack. They built for replaceability. Modular architectures where components could be swapped without rebuilding everything. They accepted that today’s choice wasn’t permanent and designed accordingly.

7. ROI timelines were a fantasy

I sat in a lot of planning conversations this year where someone projected AI ROI at 6-12 months.

Every single time, I watched experienced leaders in the room go quiet. They knew. But often, they didn’t push back.

Here’s the reality I observed: Teams that expected quick returns panicked when they didn’t materialize. They cut projects that needed another year to mature. They declared failure on initiatives that were actually on track - just on a longer track than the spreadsheet assumed.

The organizations that succeeded set honest expectations upfront. They treated AI initiatives like infrastructure investments, not SaaS subscriptions. They planned for two to three years, not two to three quarters.

I’ve started asking a new question in early conversations: “What happens if this takes twice as long as you expect?” The answer tells me everything about whether the project will survive.

8. Workflow redesign was the actual unlock

This one took me a while to see clearly.

I’d watch teams add AI to their existing processes. Same workflows, same handoffs, same bottlenecks - just with an AI component inserted somewhere in the middle.

Then I’d watch teams that rethought the entire workflow around AI capabilities.

They didn’t ask “where can we add AI?” They asked “what would this process look like if we designed it from scratch today?”

The results weren’t even close.

The first approach gave incremental improvements. The second approach gave transformation. Same technology. Completely different outcomes.

I’ve become convinced that most AI value isn’t captured by better models or better tools. It’s captured by better process design.

9. The learning loop is everything

Late in the year, I started noticing something about the teams that were genuinely scaling.

Their Day 100 systems were dramatically better than their Day 1 systems. Not because they’d rebuilt them - but because they’d designed them to learn.

Every interaction generated feedback. Every failure got analyzed. Every edge case became training data for the next version. They’d built loops, not just pipelines.

One team lead described it as “compounding interest, but for AI quality.” The teams that started this early were accelerating away from everyone else.

This shifted how I think about initial deployments. Version 1 isn’t about being good. Version 1 is about being learnable.

10. The community taught me more than any vendor

I need to end with this one, because it’s been a game-changer.

The most valuable insights I got this year didn’t come from vendor briefings or analyst reports. They came from practitioners in the AgentBuild community sharing with me what was actually happening in their work.

The engineer who admitted his “successful” deployment was held together with duct tape. The data scientist who explained why her evaluation framework failed and what she built next. The architect who shared his team’s post-mortem after pulling a production agent.

That kind of honesty is rare. And it’s worth more than any polished case study.

If I learned one thing this year, it’s that the gap between AI marketing and AI reality is vast. The people closing that gap aren’t doing it alone. They’re learning from each other.

Where this leaves me for 2026

I’m heading into next year with more conviction and more humility than I started 2025 with.

Conviction that the evaluation-first approach works. That trust infrastructure matters more than model selection. That simple, learnable systems beat complex, static ones.

Humility that I’m still figuring this out. That the landscape will change faster than my assumptions. That the best insights will come from practitioners doing the work, not observers commenting on it.

If you’re navigating this same terrain, I’d genuinely love to hear what you’ve learned. Reply to this email. Share your own lessons. Push back on mine.

This stuff is too important to figure out alone.

What did you learn this year? Hit reply - I read every response and will compile the best lessons into a follow-up piece.
Go deeper in the community: Tell me what you want to read more of. Tell me what you need help with. Reach out and ask.
“If you don’t ask, yu dont get.”
Working through this for your team? I’m opening some conversation slots in Q1 for teams moving from experimentation to production. Reply to this email if you want to talk.

Photo by Crazy nana on Unsplash

Happy New Year, everyone.

Thank you for being part of this journey. Whether you’ve been reading since day one or just joined us, I’m grateful you’re here.

Wishing you and your family a healthy, peaceful, and joyful 2026. May the year ahead bring you closer to the people and things that matter most.

See you on the other side.

Cheers,
Sandi.

Found this useful? Ask your friends to join.
We have so much planned for the community - can’t wait to share more soon.

Share agentbuild.ai

2025: The Year AI Grew Up And What That Means For Us

Sun, 21 Dec 2025 14:00:16 GMT

Hey AgentBuilders,

Hope everyone is wrapping up the year with a blast.

I am excited to bring new things for you in 2026 - new formats, new engagements, new opportunities - more on that later.

What a year this has been in AI.

2025 will be remembered as the year AI stopped being a demo and started being infrastructure. Not because of some breakthrough model drop, but because the questions changed from “What can AI do?” to “What does AI actually cost, and can we trust it?”

If you’ve been part of this community, you know we’ve been asking those questions all along.

You should get off the Model release hamster-wheel

2025 gave us permanent model season. GPT-5 variants, Claude 4.x, Gemini 3, Grok updates, Llama 4, Qwen 3, DeepSeek – the release calendar was relentless. Every few weeks, another “state of the art” benchmark screenshot on LinkedIn.

But here’s what actually happened: at the frontier, models converged. Different models won different benchmarks – one best at reasoning, another at coding, another at multimodal tasks – but no single runaway winner emerged. The gap between “best” and “second best” became noise. You won’t believe how much time I spend answering questions about model capabilities and fitness for specific use-cases.

This is exactly why I talk about defining your success metrics before picking tools. Because if you’re chasing the latest model release, you’re solving the wrong problem. The question isn’t “which model is best?” – it’s “which model combination solves my specific use case at acceptable cost and latency?”

This is the takeaway: Stop benchmarking models in isolation. Start building evaluation frameworks that test YOUR workflows, YOUR data, YOUR edge cases. The enterprises that won in 2025 aren’t the ones with access to the newest model – they’re the ones who knew how to route workloads intelligently across multiple models.

Multi-model Strategy

2025 forced something we’ve been teaching in AgentBuild Circles: you need a multi-model strategy.

Enterprises stopped betting everything on one provider and started orchestrating across several models based on cost, latency, and regulatory requirements. Some queries went to expensive frontier models, others to cheap specialized models, some to open-weights running on-prem for sensitive data.

Our Cohort 1 members learned this the hard way during their 6-week sprints. They discovered that:

Agent workflows rarely need frontier models for every step
Routing logic matters more than raw model capability
Evaluation isn’t a one-time check – it’s continuous monitoring across your model fleet

The tooling for orchestration, routing, and evaluation quietly became one of the most competitive layers in the stack. This is where the real differentiation happens now.

The economics shifted and created opportunities

2025 saw an AI price war. Token prices crashed. Inference costs plummeted. What cost $100 in early 2024 cost $10 by late 2025.

Great for users. Terrible for anyone trying to compete on raw API access.

But here’s the opportunity: defensibility moved up the stack. The winners aren’t competing on model access – they’re competing on evaluation quality, workflow integration, and domain-specific data.

This is why open-source had its best year. LLaMA 4, Qwen 3, DeepSeek variants, Mistral families reached “good enough” performance for most enterprise tasks, especially with retrieval and domain tuning. Even when companies ran these through managed services, having strong open options changed the power dynamic with proprietary vendors.

For builders: Your competitive advantage isn’t which API key you have. It’s how well you’ve instrumented your system, how fast you can evaluate new models, and whether you’ve built evaluation into your architecture from day one.

Agents became real (but Supervised)

Agents moved from demos to production. Not fully autonomous AI colleagues – more like heavily supervised macros that can take actions: updating records, triggering workflows, interacting with other tools.

This is exactly the territory AgentBuild explores. We knew the bottleneck wasn’t “can the model do it?” but “can the organization trust it?”

That’s why governance, approvals, and audit trails dominated enterprise conversations in 2025.

The Reverse Strategy Framework starts with defining success metrics and evaluation criteria precisely because of this. You can’t deploy agents at scale without knowing how to measure when they’re working and catch when they’re not.

The HYPE corrected mildly

2025 brought a reality check. The grandest claims – fully automated knowledge work, instant abundance – didn’t materialize.

But AI embedded itself deeply into search, office software, customer service, marketing, and coding.

The question evolved from “Is AI real?” to “What actually works, and at what cost?”
Investors and executives got more skeptical of vague AI pitches. But they stayed convinced it’s a long-term growth driver. The bar just got higher.

This is our opportunity. While others pivot to the next hype cycle, we focus on production readiness, evaluation frameworks, and systems that actually ship value.

What this means for 2026

If 2023-24 were “wow, this is possible” and 2025 was “what actually works?”, then 2026 is “who can build this into durable products with healthy economics?”

The winners won’t be the ones with the best model access. They’ll be the ones who:

Built evaluation-first architectures
Mastered multi-model orchestration
Built reliable data infrastructure
Focused on workflow integration over raw capability
Measured success in business outcomes, not benchmark scores

2026 is our year. Not because we have access to better models, but because we understand building the infrastructure layer that actually matters: evaluation, orchestration, and production-grade thinking.

I am excited!

Let’s build.
- Sandi

Found this useful? Ask your friends to join.
We have so much planned for the community - can’t wait to share more soon.

Share agentbuild.ai

Stop Building AI Systems You Can’t Measure

Sat, 13 Dec 2025 14:03:21 GMT

Picture this: Your team just demoed a shiny new AI chatbot. Leadership is nodding. The technology stack sounds impressive - GPT, LangChain, Pinecone. Three months of work, and it handles those carefully selected demo questions beautifully.

Then someone asks: “How do we know it’s actually working?”

Silence.

This is the trap 95% of AI teams fall into. They pick tools first, build something that works in demos, then realize they can’t answer the only question that matters: Is this delivering value?

The Teams That Win Do It Backwards

The most successful AI implementations I’ve seen follow what I call the Reverse Strategy Framework. Instead of rushing to pick models and frameworks, they start by figuring out how to measure success, design systems to capture those measurements, and only then choose their tools.

Sounds simple, right? But it changes everything.

A Banking Chatbot That Almost Failed

Let me show you what this looks like in practice. A retail bank came to me with their customer support chatbot already planned out: GPT-4, LangChain, Pinecone. They just needed help launching it.

I asked three questions:

How will you know if the chatbot makes correct refund decisions?
When something breaks, how will you debug it?
What does each customer interaction actually cost?

They had no answers. These questions don’t surface when you start with tools—they only emerge when you start with measurement.

The Framework That Actually Works

Layer 1: Define Success Metrics

Before touching any code, we defined what “working” meant across four categories:

AI Performance: 85% accuracy on fee refunds, sub-30-second responses
Business Results: 60% deflection rate, 40% faster resolution times
Safety: Zero customer data leaks, complete audit trails
Cost: Under 50 cents per query (vs. $5 for human support)

This immediately revealed three massive problems with their original plan. They had no test cases, no way to trace AI decisions, and no handling for customers with multiple accounts. All invisible when picking tools, obvious when defining measurements.

Layer 2: Build for Measurement

We designed two systems simultaneously:
- the data foundation the AI needs
AND
- the tracking system to see what it’s doing.

Every policy section got tagged with metadata. Every decision created a traceable log. We built 300 labeled test cases from historical data.

The payoff? When accuracy dropped in month two, we could see exactly why: the AI was calculating refund eligibility from the wrong timestamp. Fixed in hours, not weeks.

Layer 3: Choose Your Tools

Only now did we select models and frameworks. Our requirements made the choices obvious: Claude for transparent reasoning, RAG for policy flexibility and citations, proactive filters for safety checks.

The Dashboard That Changed Everything

Every Monday, leadership sees real-time metrics: queries handled, satisfaction scores, cost per query, quality issues. When accuracy dips or costs spike, alerts fire automatically.

Week six: Fee refund accuracy dropped to 81%. Alert triggered. Team traced it to a missing policy update. Fixed in 4 hours.

Week eleven: Customer satisfaction on transfers hit 3.8. Logs showed technically correct but confusing responses. Simplified templates. Back to 4.3 within days.

Your Choice: Design or Firefighting?

Yes, this is more upfront work. The question is: Would you rather spend two weeks designing measurements before you build, or six months firefighting a system you can’t see into after launch?

Would you rather walk into your CFO’s office with a dashboard showing 60% deflection and $0.50 per query, or say “We think it’s working, but we’re not sure how to measure it”?

If you can’t measure it, you can’t improve it. If you can’t prove it works, you can’t defend the investment.

The reverse strategy isn’t about being cautious - it’s about being successful. Start with measurement. Build for observability. Let requirements drive tools.

Your future self (and your CFO) will thank you.

Free Templates That Help You Get Started

I’ve created a few templates that can help you get started. Find the links in the description of the video. Download them, review them, and get in touch if you have questions.

Watch the full breakdown in the video, where I walk through all three layers with real examples and decision points.

Want to dive deeper into production AI systems? Subscribe to AgentBuild for weekly practical tutorials on building reliable AI agents.

Found this useful? Ask your friends to join.
We have so much planned for the community - can’t wait to share more soon.

Share agentbuild.ai

The Business of Coding Agents: How Cursor Hit $1B ARR While Research Shows Developers Got 19% Slower

Sandipan Bhaumik — Sun, 30 Nov 2025 14:02:39 GMT

Hello everyone,

Coding agents have gone viral in enterprises this year. I don’t find any of my customer not using a coding agent. The scale and patterns of usage vary, but everyone is investing in a coding agent. A question I often get asked is about the ROI on what is popularly known as “vibe coding“.

I like to look at this from two different angles:

How do the coding agent providers like Cursor, Replit make money
How do enteprises make money using coding agents.

Let’s dive in and look into the business of coding agents.

Today I answer the following questions:

How do coding agent companies like Cursor and GitHub Copilot actually make money, and why are their margins terrible?
Does AI really make developers more productive - what does the actual research show?
What’s the real ROI of coding agents for enterprises, and why do individual gains fail to translate to company-level improvements?
Why is “vibe coding” potentially bigger than the entire current coding agent market?
How should enterprises actually measure success with coding agents instead of blindly rolling them out to everyone?

Let’s talk about the business of coding agents first. Cursor hit $1B ARR in just 24 months - the fastest any SaaS company has ever scaled. This is insane. Cognition AI (maker of Devin) is now valued at $10.2 billion. GitHub Copilot generates around $400M in annual recurring revenue. There’s real money being made here.

But here’s what’s interesting: the people making money from coding agents and the people getting value from them aren’t always the same organizations.

There’s a bigger shift happening that most people aren’t paying attention to - one that could make the current coding agent business models look quaint in comparison.

Let’s talk about the actual business models at play, what ROI really looks like when you strip away the demo magic, and where this market is actually headed.

How Coding Agent Companies Actually Make Money

The business model is beautiful in its simplicity: seat-based SaaS at developer tool pricing.

Cursor charges $20/month for pro and $40 for business subscriptions. Copilot ranges from $10-$19/month depending on tier. The math is straightforward - if you can get into an engineering org, you’re looking at recurring revenue that scales with headcount.

But here’s where it gets interesting. The actual cost structure for these companies is wild. Every completion costs them money - API calls to models like Claude or GPT-4 can range from $3-$15 per million tokens for mid-tier models, and that’s before accounting for the massive context windows these coding tools require.

According to Base 44’s founder Maor Shlomo (discussing the economics of “vibe coding” platforms), the margins in this business “suck” right now. But model prices are racing toward zero. Companies can switch LLM providers with a simple code change, creating insane market dynamics where hundreds of thousands of dollars in spend shift between Anthropic, OpenAI, and Google overnight based on which model offers the best cost-performance ratio.

This is why you see aggressive usage caps. Why context limits exist. Why some features are gated behind enterprise tiers. They’re managing margin by managing how much you can actually use the thing you’re paying for.

The margin play is intelligent routing - sending simple requests (”change this button color”) to smaller, cheaper open-source models while reserving the expensive frontier models for hard problems. The more requests the cheap models can handle, the better the margins become.

Devin’s playing a different game - they’re going after the outsourcing market. Instead of “pay us per developer seat,” it’s “pay us instead of contractors.” Cognition’s ARR grew from $1M in September 2024 to $73M in June 2025, suggesting someone finds that value prop compelling.

What ROI Actually Looks Like

I talk to CTOs all the time. The ones getting real ROI from coding agents aren’t seeing it where you’d think.

The obvious assumption: “My developers code 30% faster, so I save 30% on engineering costs.”

This is fantasy. Here’s why:

Engineering cost isn’t in typing speed. It’s in figuring out what to build, architectural decisions, debugging production issues, and all the stuff that happens around the code. If coding agents make the typing part faster, you’ve optimized maybe 15-20% of the actual work. And that’s best case.

Here’s what the research actually shows: A randomized controlled trial by METR with 16 experienced developers found that when using AI tools like Cursor Pro with Claude, developers actually took 19% longer to complete tasks. Not faster. Slower.

Okay, before you panic - this was measuring experienced developers on complex, mature codebases. The story changes depending on context.

Industry reports show AI coding assistants increase individual developer output by 20-40% in vendor studies but research from Faros AI analyzing over 10,000 developers found no significant correlation between AI adoption and company-level performance improvements. The gains at the individual level don’t translate to business outcomes.

Why? Bottlenecks. Teams with high AI adoption complete 21% more tasks but PR review time increases 91%. You’re just moving the constraint.

Where is the Real Money right now?

The companies actually making money with coding agents fall into a few buckets:

1. The Consulting Play

Consulting firms and agencies are printing money with coding agents. Why? Because they bill by project, not by hour, and their margins just went up.

If you’re a dev shop charging $150K for a project that used to take 300 hours, and coding agents help you finish it in 200 hours, you just made an extra $15K assuming $100/hour cost. Do that across 50 projects and you’re talking real money.

But, this only works if clients don’t know you’re using agents. The moment they know, they renegotiate pricing. So there’s this weird incentive to keep quiet about the tooling - which is unethical is many aspects.

2. The Junior Dev Leverage

Some companies are hiring cheaper, less experienced developers and using coding agents as the experience multiplier. Instead of a team of $200K senior engineers, you’ve got $120K mid-level engineers with AI assists.

The math works if, and this is a big if, the quality doesn’t drop and the senior engineers aren’t spending all their time fixing agent-generated code.

I’ve seen this work in specific contexts: building internal tools, working on well-understood patterns, extending existing systems. I’ve seen it fail catastrophically when applied to greenfield architecture or complex system design.

3. The Maintenance Cost Reduction

This is where I see the clearest ROI. Using coding agents for the boring, necessary stuff that eats engineering time:

Writing tests for legacy code
Updating dependencies
Migrating APIs when external services deprecate
Generating documentation
Refactoring for style guide compliance
Standardising tutorials across the organisation.

Research shows that for common tasks like these, teams can see genuine time savings, though the exact magnitude varies wildly by use case.

The Cost Side that is Easy to Ignore

Here’s what the ROI calculators from vendors don’t include:

Time to productive use. It’s not plug-and-play. You need to train your team, establish patterns for what works, create guidelines for when to use agents vs not. I’ve seen companies spend 3-6 months getting to productive use. That’s cost.
Code review overhead. AI adoption is consistently associated with a 9% increase in bugs per developer and a 154% increase in average PR size. Agent-generated code needs review. Sometimes more careful review because it’s not from someone whose judgment you know.
Technical debt accumulation. Agents are great at generating code that works. They’re terrible at generating code that fits your architectural vision. If you’re not careful, you end up with a codebase that’s a franken-pattern of different styles and approaches.

The Enterprise Trap

Here’s where companies lose money: buying enterprise licenses for their entire engineering org because “everyone should have access to the best tools.”

You just committed to $20/month × 200 developers = $48K/year.
What’s the return?

Most companies can’t tell you. They’ll say “productivity gains” but they’re not measuring it. They don’t know which teams are using it effectively, which features provide value, or whether the ROI is positive.

This is how coding agent companies make money - selling to enterprises who don’t measure outcomes.

What Good Looks Like

The companies getting real ROI are treating coding agents like any other tool investment:

They start small. Ten seats for a specific team on a specific project. They measure actual outcomes - deployment frequency, bug rates, time to ship features, whatever matters for that project.
They track costs honestly. Not just subscription fees, but review time, quality issues, time spent on prompt engineering.
They calculate actual ROI. Did this $2K investment save us $10K in engineering time? Can we prove it?
Then they scale what works and kill what doesn’t.

The Bigger Game

Here’s where it gets really interesting. While everyone’s focused on whether Cursor or Copilot wins the developer productivity race, there’s a much bigger market emerging that could make the current business models obsolete.

The market is moving from “helping developers code faster” to “replacing entire categories of software.”

Think about it: right now, coding agents help developers build software. But what if the endgame isn’t faster development - it’s customized software on demand?

According to Maor Shlomo, this “vibe coding” category could become the largest in the software industry.

Here’s why:

Software is moving away from “one size fits all.” Instead of buying a CRM license from Salesforce, you take a template and use vibe coding to customize exactly what you need. Arabic right-to-left interface? Done. Custom lead pictures? Done. No feature bloat, no vendor lock-in, you own the code and data.

The prediction: within a few years, it might be easier to build a personalized Salesforce-type CRM than to buy an off-the-shelf license.

This doesn’t just compete with Cursor. This competes with Salesforce, Monday.com, and every SaaS company that’s essentially a front-end on a database.

The Real Moat in This Market

Anyone can build a tool that generates a simple website. That’s commoditized already.

The moat is in complexity. Building a platform that can help people create functional products for real-world use cases - sometimes involving millions of lines of code. Organizational tools. Functional platforms. Complex applications.

The other moat is vertical integration. When Base 44 builds a fully integrated platform with built-in backend, database, user management, authentication, integrations, and analytics all in-house, it becomes very hard to replicate and even harder for users to migrate away. Compare that to competitors using third-party providers like Supabase - much easier to switch.

But here’s the problem: competitive velocity is insane. Features that used to take years to copy now take weeks or months. This means you have to take big bets on things that are hard to copy while moving at maximum speed.

The Strategic Game Right Now

For coding agent companies, the current game isn’t about margins. It’s about growth and capturing what Shlomo calls an “insanely big market.”

As LLM prices race toward zero, the cost problem solves itself. Revenue per customer will naturally decrease as models get cheaper, but the market is so large that growth matters more than optimizing margins today.

The real competitive threat isn’t other coding agent companies. It’s a model provider winning big. If Google dominates with Gemini (they have the compute, the stack, the data, the integrations), their next move would be conquering the vibe coding market themselves.

The counter-strategy: grow fast, then introduce a proprietary model (like Cursor’s Composer) that lets you move users from expensive third-party models to your high-margin internal model.

How Organizations Should Actually Think About This

If you’re buying coding agents today:

Be honest about costs. Include everything - subscriptions, training time, review overhead, quality issues.
Measure ruthlessly. Pick specific metrics before you start. Track them honestly. Developers themselves predict AI would make them 24% faster, but actual measurements showed a 19% slowdown. Don’t rely on feelings.
Start narrow. One team, one use case, one project. Prove ROI before scaling.
Kill what doesn’t work. Most companies keep paying for seats that aren’t providing value because they don’t want to admit the experiment failed.
Watch the bigger shift. If vibe coding takes off, you might not need as many developers at all. You might need product people who can specify what they want built. That’s a different skill set, different org structure, different budget allocation.

The Real Question for Your Business

Here’s what you should be asking: Are you in a business where customized software is your competitive advantage, or are you in a business where off-the-shelf SaaS is fine?

If you’re the former, vibe coding platforms might let you build exactly what you need without the engineering team you’d normally require. If you’re the latter, coding agents that make your existing team faster are probably the right play.

But if you’re a SaaS company selling software that’s fundamentally just a database with a UI layer? You should be worried. Because in a world where anyone can spin up a customized version of what you sell in an afternoon, your moat disappears.

The Bottom Line

Coding agents are a real business. Cursor became the fastest-growing SaaS company of all time, hitting $100M ARR in roughly 12 months. People are making real money - both selling them and using them.

But the current game (selling seats to make developers faster) might just be the opening act. The real business of coding agents might be replacing entire software categories by making custom software as easy to get as buying a SaaS subscription.

The vendors making money today are selling subscriptions at scale. The successful users are being selective and ruthless about ROI. The big winners tomorrow might be the ones who figure out how to let non-technical people build what they actually need instead of settling for what vendors decide to sell them.

Which side of that future are you building toward?

References for data I used in this article:

Sacra Research - Cursor revenue analysis
Bloomberg/TechCrunch - Cognition AI funding and valuation
METR - Randomized controlled trial on AI coding productivity
Faros AI - AI Productivity Paradox research report
Multiple industry reports on AI coding assistant ROI
Base 44 founder interview on vibe coding economics

Found this useful? Ask your friends to join.
We have so much planned for the community - can’t wait to share more soon.

Share agentbuild.ai