AI Is Making Decisions With Data You Can’t See

3 types of data your AI uses that you can't see and why that's costing teams weeks in debugging. Conversation history, version drift, and the promise vs. delivery problem.

Nov 16, 2025

Hello lovely people,

Another field note from the trenches - the real, unpolished stuff I see teams wrestling with when they’re trying to ship AI that actually work in the wild… not just in (you already know this)… demo-land, haha!

Today’s note comes from a pattern I keep seeing across startups and mid-market companies building their first AI agents. Teams are so focused on getting their AI to give good answers that they forget to check how it arrived at those answers. They’re measuring outputs but ignoring the data trail that produced them.

And that’s where things get messy.

So let me share what I’m learning about the data your AI is using - the data you probably can’t see right now - and why making it visible is the difference between shipping something reliable and trustworthy.

Today I answer the following questions that often come up in my meetings:

Why is our chatbot responses failing with real customers while they looked perfect in testing?
We’re getting customer complaints about wrong answers, but our logs just show the chatbot’s response - how do we debug what actually went wrong?
Our accuracy metrics dropped 15% overnight and we didn’t change anything. What happened?
How do we know if the AI agent is using current data or making decisions based on stale information?

person behind fog glass — Photo by Stefano Pollio on Unsplash

Seriously, I have seen AI agents making decisions, taking actions based on data the business was never aware of. And that’s a problem.

Think about it, you’re about to ship an AI agent. It gives great answers in demos. Your stakeholders are impressed. Everyone’s ready to celebrate.

But pause for a second.

Do you actually know what data your agent is looking at when it makes decisions?

Not the training data. Not the prompt. I mean the real-time information it’s pulling when it tells your customer “Yes, your refund is approved” or “No, you’re not eligible.”

Because here’s the thing: if you can’t see the data your agent used, you can’t verify if its decision was correct. And if you can’t verify correctness, you’re not doing AI - you’re doing expensive guessing.

Let’s Talk About What Your AI Actually Does

Your AI agent isn’t magic. It’s a decision-making system.

When someone asks “Can I get a refund?” your agent:

Looks up their account
Checks your refund policy
Verifies eligibility criteria
Calls your backend systems
Generates an answer

Each step uses data. And the quality of that data determines whether your agent gives the right answer or confidently makes things up.

Let’s look at an example:

Say you are about to ship an chatbot that handles customer bank accounts.

Your banking chatbot says: “I’ve refunded your £35 purchase. You’ll see it in 2-3 days.”

Sounds good! Ship it!

But wait - Which version of your refund policy did it apply? The one from last week or the one compliance updated this morning? Did it actually submit the reversal to your backend, or just say it did? Was the customer legitimate? Did the agent validate it?

If you don’t know the answers, you don’t know if your AI works.

You just know it sounds convincing. And “sounds convincing” is not a success metric.

Three Flavors of Invisible Data You’re Probably Missing

Flavor 1: The Conversation So Far

Testing AI with single questions is like judging a surgeon by watching them make one incision. You’re missing the whole operation.

Real conversations have history:

“I need a refund”
“For which transaction?”
“The one from Tuesday”
“You have three transactions from Tuesday. Which one?”
“The coffee shop one, I already told you”

If you’re only evaluating that last response without seeing what came before, you have no idea if your agent:

Remembered what the user said
Contradicted itself
Applied consistent logic
Actually understood the context

❌ The mistake: Evaluating each response in isolation.

✅ The fix: Log the full conversation. Test multi-turn flows. Check if your agent’s memory works as well as its vocabulary.

Flavor 2: Which Version of Reality Your Agent Is Using

Pop quiz: Your AI said a customer qualifies for a discount yesterday. Today it says they don’t. What happened?
Option A: The AI got worse
Option B: Your pricing rules changed
Option C: The customer’s account status changed
Option D: You have no idea because you’re not tracking versions
Most teams pick D.

When your refund policy changes, your eligibility rules update, or your pricing tiers shift, your “ground truth” moves. If you’re not versioning the data and rules your AI uses, you can’t tell if failures are bugs or correct responses to new information.

❌ The mistake: Assuming your evaluation dataset stays valid forever.

✅ The fix: Tag every decision with metadata. Which policy version? Which data snapshot? Which timestamp? Make your AI’s decisions reproducible, not just explainable.

Flavor 3: What Actually Happened vs. What Your AI Said Would Happen

Your AI says: “Your refund has been processed.”

Your backend system says: “Payment reversal failed - insufficient authorization.”

Who’s right?

This is the gap between AI outputs and actual outcomes. Your agent can promise things your systems can’t deliver. And if you’re only measuring what your AI said, not what your systems did, you’re grading it on fiction.

❌ The mistake: Trusting AI outputs without verifying backend results.

✅ The fix: Close the loop. Join agent responses with system logs. Track promises vs. deliveries. When they don’t match, that’s your signal to investigate.

Why This Matters More Than You Think

You might be thinking: “Okay, we need better logging. Got it.”

But this isn’t about logging. It’s about having evidence instead of hope.

❌ Most teams ship AI like this:

Build the thing
Test it until it looks good
Ship it
Hope production looks like testing
Debug frantically when it doesn’t

✅ Teams who ship reliable AI do this:

Decide what success means (with metrics)
Build measurement into the system from day one
Verify the AI has access to correct, current data
Test against realistic scenarios, not cherry-picked examples
Ship with evidence that it works
Monitor what it’s actually doing in production
Fix issues before customers notice

The difference? The second group can prove their AI works. The first group just believes it does.

And belief is not a debugging strategy.

What You Actually Need to See

Before you ship, make sure you can answer these questions:

About the data:

What information did my AI retrieve?
Was it current or stale?
Which version of my rules/policies did it use?
What was the timestamp?

About the decision:

What logic did it follow?
What assumptions did it make?
What trade-offs did it consider?
Why this answer and not another?

About the outcome:

Did the promised action actually happen?
Did backend systems confirm it?
Were there errors I need to know about?
Does reality match what my AI said?

About the conversation:

Is my AI consistent across turns?
Does it remember what users told it?
Has it contradicted itself?
Is it applying the same logic every time?

If you can’t answer these, you’re not ready to ship. You’re ready to hope.

The Uncomfortable Truth

Your AI is only as good as your ability to verify it’s doing what you think it’s doing.

You can have the fanciest model in the world. But if you can’t see what data it’s using, which rules it’s applying, and whether its outputs match reality, you’re building a black box that makes decisions you can’t defend.

And when something breaks - for sure and it will - you’ll be stuck guessing instead of knowing.

The teams winning at production AI aren’t necessarily using better models. They’re using better measurement.
They know what their AI is doing because they designed systems that make the invisible visible.
They test with evidence, not intuition.
They ship with confidence, not hope.
And when things go wrong, they have the data to understand why and fix it fast.

Start With What You Can See

You don’t need to measure everything perfectly from day one. But you need to start making the important stuff visible.

Pick one thing your AI does that matters to your business. Then ask:

What data does it need to do this correctly?
How do I verify it’s using the right data?
What does success actually look like?
How will I know if it’s working?

Build measurement into your system. Version your truth. Track your outcomes.

Because the scariest thing about AI is that you won’t know it’s wrong until your customers tell you.

And by then, it’s too late.

What’s Coming Next Week

Now that you know what invisible data is and why it matters, the obvious question is: How do I actually build this?

scrabble tiles spelling the word next week on a blue background — Photo by Matilda Alloway on Unsplash

Next week, I’m sharing the technical implementation guide - the actual code structures, tools, and architecture patterns teams are using to make invisible data visible.

I’ll show you:

How to structure trace logging for multi-turn conversations (with actual JSON schemas you can use)
How to implement policy version tagging so you can debug “the model got worse” problems
How to build outcome verification loops that catch promise/delivery mismatches before customers complain
How to set up multi-turn evaluation harnesses that actually test how your AI behaves in real conversations

Plus the pragmatic week-by-week implementation plan so you don’t try to boil the ocean on day one.

Because understanding the problem is step one. Building the solution is where the real work begins.

In the meantime: Pick one decision your AI makes that matters to your business. Write down what data it needs to make that decision correctly. Then ask yourself: can you verify it’s actually using that data?

If the answer is no, you’ve found your starting point.

I write about building AI systems that work in reality, not just in demos. Subscribe to this for frameworks that help you ship AI you can actually trust.

Evolving the Newsletter

This newsletter is more than updates - it is our shared notebook. I want it to reflect what you find most valuable: insights, playbooks, diagrams, or maybe even member spotlights.

👉 Drop me a note, comment, or share your suggestion anytime.
Your feedback will shape how this evolves.

Found this useful? Ask your friends to join.
We have so much planned for the community - can’t wait to share more soon.

Share agentbuild.ai

agentbuild.ai

Discussion about this post

Ready for more?