The 7-Step Playbook for Turning Any Business Process Agentic
Save this playbook. A framework to help you power your business processes with AI agents. Which step are most enterprise teams skipping? Step 2. And it's why their agents fail in production.
Something happens in almost every meeting I’m in these days.
Someone opens a slide, or just starts talking, and within two minutes we’re deep into a conversation about models. Vendors. Which LLM is better for this use case. Whether to go with one orchestration framework or another.
Features. Availability. Pricing tiers. Token limits.
And I sit there. Listening. Waiting.
Then I ask something like: “What does success look like for this?”
Or: “How will you know in six months if this worked?”
The room usually goes a bit quiet. Sometimes people look at each other. Sometimes someone gives a vague answer about “efficiency” or “reducing manual effort.” And then, almost without fail, the tools conversation resumes.
I’ve stopped being surprised by this. But I haven’t stopped being bothered by it. Because the tool conversation feels productive. It has energy. People have opinions. There’s something to debate. Meanwhile the question of what you’re actually trying to achieve - measured, specifically, in a way you could verify, just sits there unanswered.
And then teams wonder why their agentic systems don’t survive contact with production.
This is the playbook I wish more of those meetings started with.
First question: do you actually need multi-agent?
I ask this because nobody else does. The assumption in most rooms is that “agentic” means multiple agents.
Sometimes that’s right. If decisions in your process genuinely can’t coexist - different data access, different authority, different latency requirements, then splitting makes sense.
However, multi-agent systems are just distributed systems. And distributed systems are hard. When something breaks, you’re chasing a failure across boundaries, through handoffs, through tool calls you can’t always replay. That complexity doesn’t disappear.
Start single agent. Let the constraints of the actual process push you toward multi-agent if they need to.
Most of the time, the process doesn’t need it. The team just wanted to build it.
Where does the AI go, and where does the human stay?
The most common mistake I see: humans get put at the end. Final review. Rubber stamp before anything goes out. It feels like a safety net.
A human reviewing an output they didn’t generate, without the context that produced it, at the end of a chain they can’t fully see - that’s not oversight. That’s decoration.
AI belongs where errors are recoverable. Humans stay where they’re not.
A compliance violation, an irreversible action, something you’d find out about through an audit three weeks later - those stay human until the system has earned the right to handle them.
You don’t decide in a design session that the AI can handle something. You prove it. Slowly. With data.
The human doesn’t leave the loop because the architecture says so. They leave because the evaluation says it’s safe.
Does the whole process need to go agentic?
Almost certainly not. But the pressure to say yes is enormous right now.
The ROI case always gets built for the full process. End-to-end automation, scale infinitely. That’s how it gets approved. And then reality shows up - the data isn’t ready, the decisions aren’t defined clearly enough, and nobody can tell whether any of it is working.
What I’ve seen to work: find the one or two decisions where human time is most expensive or delay is most painful. Start there. Leave the rest human for now.
The process you want to automate probably wasn’t that well-designed to begin with. AI will find every shortcut, every undocumented exception, every “we just know” that your team built into it over the years. Automating the whole thing at once means hitting all of that simultaneously.
Pick one decision. Define it properly. Prove it works. Then move.
The 7-step playbook
This is the Reverse Strategy Framework applied to process agentification.
The order matters.
Each step is a gate - if you can’t pass it, you’re not ready for the next one.
1: Map the decisions, not the steps.
Get the people who actually run the process in a room. Have them document the judgment calls. At every point where a human exercises discretion: what are they looking at, what makes them go one way versus another, what would a wrong call look like, and how quickly would you know?
Ask: if you gave two experienced people the same input, would they make the same call? If they regularly don’t - you don’t have a process you can automate.
You have a process you need to design first. You can’t build an agent to make a decision the organisation hasn’t agreed on.
2: Define what ‘good’ looks like for each decision.
For each decision node you’re considering, map what precision do you need, what’s an acceptable error rate, what’s the cost difference between a false positive and a false negative? Quantify with numbers.
These aren’t metrics you figure out after you build. They’re the thing that tells you whether the system is working at all. Most AI projects fail because nobody defined what capable meant. Do it here, before anything else.
3: Check your data readiness, decision by decision.
Agents make inferences from data. For each decision node you want to automate, check whether the data exist, can an agent access it in real-time, and is it structured in a way the agent can reliably use?
Most enterprise processes run on data designed for humans. PDFs. Exports. Systems that require a login and three clicks. Context that only exists because someone’s been in the role long enough to know where to look.
Check five things per node: how accessible the data is, whether the schema is clean and consistent, whether there’s enough metadata for the agent to interpret what it’s looking at, how errors and edge cases are handled, and whether there’s any observability into what the data’s doing.
If a node’s data isn’t ready, build the data layer first. A good model on broken data is still broken.
4: Assign each decision - AI, Human, or Hybrid.
Using what you’ve defined in Steps 2 and 3: AI handles high-volume, well-defined decisions where errors are recoverable. Humans handle decisions where a wrong call is asymmetric and non-recoverable. Hybrid - AI proposes, human confirms - is for the middle ground, where you think it’s probably automatable but don’t yet have the data to prove it.
Write this down. It becomes the architecture contract. If someone later asks why the agent doesn’t handle a particular decision, the answer is already there.
5: Decide single vs. multi-agent.
You now have the decision map. Look at it. Are there nodes that genuinely require incompatible contexts - different data access, different authority, reasoning chains that need to be isolated from each other? Those are your split points. If not, stay single.
I have seen many teams start with this conversation. It actually belongs here, in Step 5, with real information in front of you.
6: Build the evaluation before you build the agent.
I know this feels backwards. Build the thing first, then measure it - that’s the instinct. Don’t do it. Please.
Before you write a line of agent code, collect 200+ real examples from the process. Actual inputs, paired with what a good human would have decided on each one. Then define how you’ll score whether the agent’s call matches that standard.
This forces a useful confrontation: can you actually define “correct” before the AI has to? Sometimes you can’t. That’s valuable to discover now rather than six months into production.
Evaluation isn’t a final checkbox. It’s the architecture that keeps the whole thing alive.
If you can’t build a golden dataset for a decision node, that node isn’t ready. That’s not a failure. That’s the process working.
Step 7: Shadow mode first. Then cut over.
Run the agent on live inputs in parallel with the humans. Humans keep making the real decisions. You compare outputs - systematically, against your golden dataset and against the human calls on the same inputs.
The edge cases that didn’t show up in your test set will show up here. They always do. Shadow mode is where you find them safely, without consequences, while building the evidence base that earns trust.
Cut over when the error rate threshold is met. Not when the launch date arrives.
The question that tells you if you’re ready
Before any of this starts, ask yourself one thing:
If you ran the human process and the agentic process side by side on the same inputs for 90 days, would you have the data to prove the agent is performing at least as well?
If yes - you’ve defined success, you have evaluation infrastructure, you’re ready.
If no - something foundational is missing. You haven’t defined success clearly enough, the data can’t support measurement, or you don’t have a golden dataset to compare against. That’s Evaluation Debt.
The process you want to turn agentic probably has a version that can work. Whether you get there depends on whether you’re willing to answer the hard questions before the tools conversation starts.
Most meetings I’m in, we never get there.
If you’re in the middle of this - mapping a process, arguing about scope, trying to figure out where the human stays in the loop - hit reply. Tell me where it’s stuck. I find these problems genuinely interesting could share my insights.
If you enjoyed reading this, please share with your friends. Leave a feedback, and tell me what you would like to read more of.
Thanks,
Sandi.
Ask your friends to join.
More valuable content coming your way.
Thanks for reading agentbuild.ai! Subscribe for free to receive new posts and support my work.
