Why AI Agents Must Be Proven Before They Are Deployed

AI agents can transform customer experience — but without rigorous testing, simulation, and observability, they risk undermining the trust enterprises depend on

Why the AI Agent Rush Breaks Trust Faster Than Leaders Expect

Puent put it plainly: “Poor customer experience equates to losing trust.” And the business cost of lost trust is quantifiable: According to Harvard Business Review’s 2025 Customer Retention Technology report, acquiring a new customer costs 5 to 25 times as much as retaining an existing one. Every customer interaction that erodes confidence has a downstream price tag that rarely shows up on the deployment dashboard.

He went on to argue that many organizations are still validating AI agents in ways that do not reflect production reality.

Manual testing can help early on, but it doesn’t scale to the number of scenarios an AI agent will face once it’s handling live customer interactions. “People testing AI agents today is largely a non-automated function,” Puent said. “They’ve got a human being sitting down, interacting with it, subjectively saying… it was good or bad.”

The bigger gap is realism. Desk-based roleplay can’t recreate the messiness of real calls – noise, interruptions, stress, mumbling, changing context mid-sentence. AI-powered simulation can replay and vary thousands of real-world interactions so teams can test broader coverage and tougher conditions without exposing live customers to failure.

“You’ve got to find ways to minimize the risk and test more scenarios, complete with confidence,” Puent said.

In other words, the rush is not just about speed. It is about the mismatch between how AI agents behave, and how many organizations still evaluate readiness.

Testing that doesn’t reflect the full range of real customer scenarios isn’t testing, it’s wishful thinking. And wishful thinking is how you discover your AI agent’s worst moments live, in front of customers who will remember.

AI Agents Are an Operational Change, Not a Technical Launch

Puent warned that when organizations treat deployment as a purely technical exercise, they may check the engineering boxes while missing the customer experience outcomes that actually matter.

“You might be missing the feedback loop… the things that lead to the customer experience,” he said. “Even if it passed all of the tests, but it’s a miserable experience for the customer, you’re doing the wrong thing.”

Shen focused on what changes inside the customer experience operation, when AI agents become decision-makers between IVR and a human advisor. In traditional flows, the customer chooses to reach a person. With an AI agent, the system makes decisions that shape the conversation before a human ever joins. “There’s an AI agent in the middle, making individual decisions,” Shen said.

“On the human side, they need to know what decisions that AI agent made and have… observability… or else… they might be doing the duplicate things or might be confusing the customer.”

This is where many deployments stumble. Without visibility into what the AI decided and why, escalations become slower, more confusing, and significantly more costly. When your team has to piece together a broken AI interaction, handle times spike, and as Forrester’s January 2025 report, Improving CX Can Drive More Than $1B In Revenue, highlights, these operational friction points actively drain millions from the bottom line.

Systems designed to avoid human escalation will optimize for containment. Systems designed for intelligent and flexible AI (agentic and deterministic with a human handoff when needed) will optimize for resolution. The architectural choice determines which outcome your customers experience — and which one they remember.

The Cost Of Getting It Wrong Is a Trust Bill You Pay Later

Shen tied the risk to Amazon’s ‘earn trust’ leadership principle and connected it to a business reality CX leaders know well.

“Trust is hard to earn and easy to lose. Once the trust is lost, it is more expensive to earn back.”

He also pointed out that retaining existing customers is cheaper than winning new ones, which means failures that erode loyalty can become a compounding cost.

That compounding effect is more severe than most deployment teams expect: one bad AI interaction can drive churn from a customer who wasn’t planning to leave.

Lost trust is costly, and often impossible, to win back with the same customer. Retention is always cheaper than acquisition, so deployment isn’t just a product decision; it’s a retention decision.

For Shen, the answer is not fear. It is discipline. Enterprises need confidence that AI agents can handle real scenarios before customers are exposed to them.

That requires testing beyond the happy path.

“You got to let AI agent… behave in those edge cases,” Shen said. “My team and I, we’re building those tools at Amazon Connect Customer to let you validate at scale before you deploy.”

Why Pre-Production Has To Be a Phased Discipline

Shen argued that skipping structured pre-production stages is a dangerous shortcut because it shifts uncertainty into the environment where mistakes are most visible and most expensive. Forrester’s 2025 Total Experience Score reinforces this, showing that brands suffering public CX failures face steep financial penalties in immediate customer churn and long-term recovery costs.

Pre-production testing isn’t just technical validation, it’s the basis of customer trust. Every scenario you don’t test before launch is one your customers will test for you, with their real experience at stake.

“When you’re testing in production, it’s just too risky,” he said. “You’re exposing all the issues you haven’t found out before… you really don’t want the real customers to test your product.”

He described a phased approach, moving through environments such as dev, beta, gamma, and pre-production. Each stage exists to catch issues early, automate repeatable validation, and make rollback possible if changes introduce problems.

This approach allows you to slow down and then speed up. Resolving any issues means you go live quickly and smoothly, then adding more use cases also goes quickly and smoothly.

For AI agents, the requirements are fundamentally different than with deterministic automation — because agent behavior changes with context and language.

“AI agents, they’re not static,” Shen said. “They respond in context… and they’re non-deterministic.”

In other words, they’re probabilistic: they adapt to tone, phrasing, and context in real time, which is what makes them feel conversational. But it also means you won’t always get the exact same answer twice -introducing a class of risk traditional automation doesn’t.

That’s why “slowing down to speed up” is a strategy, not caution.

Shen also challenged the idea that skipping stages “saves time.” “It just moves the issue and the problems from testing to production,” he said.

Simulation Is How You Find The Gaps Customers Will Trigger

One step Shen believes is skipped too often is simulation, particularly at scale. For AI agents, simulation is not only about testing responses. It is also about testing the quality of the knowledge base the system depends on.

“If you put garbage in, you get garbage out.”

He explained that customers ask the same question in many different ways. They bring different moods, different assumptions, and different language patterns. That’s why effective simulation needs to be grounded in real interactions – not desk-written scripts.

With Amazon Connect Customer, teams can replay thousands of historical customer conversations against an AI agent before it goes live, testing the full range of how people actually speak, escalate, get confused, and change their minds. Done at that scale, simulation can surface ambiguity in documentation and expose scenarios where the agent may fail or hallucinate, before any real customer is affected.

Each gap discovered in simulation is a piece of customer trust you just protected. Each gap discovered in production is a piece of customer trust you just lost.

“With large-scale simulation, you’re able to test your documentation, your knowledge base thoroughly,” Shen said. “Customers… ask the same question in many different ways.”

They also ask it from everywhere – at home, in a noisy coffee shop, in the car with kids in the backseat – often distracted, stressed, or mid-task. That real-world variability is hard to recreate in desk-based testing, and it’s where large-scale simulation earns its value.

Puent connected that to a broader shift: expectations change continuously, so the system needs continuous feedback and refinement. “What customers will tolerate today… is different than it was… even six months ago.”

That framing matters because it positions testing and simulation as ongoing operational capability, not pre-launch housekeeping. As customer expectations outpace deployment cycles, organizations that build continuous simulation into AI operations will outperform those that treat testing as a one-time gate. It’s the operational habit that separates trusted AI programs from fragile ones.

What Responsible Go-Live Looks Like: Start Small, Measure, Scale

So how do teams know when it is “safe” to progress? Puent said it will always be a business decision, but it should be made with evidence.

The key questions are not abstract. They are operational: Have you identified edge cases? Have you measured outcomes? Are customers getting stuck? Is customer effort going down, or are you just shifting work elsewhere?

He then highlighted a practical technique inside Amazon Connect Customer that supports controlled experimentation: gradual rollout using traffic percentages.

“You can take one percent of your traffic and point it to a different flow,” Puent said. Start with 1% of live traffic. Measure. If it works, increase to 5%, then 10%, then 50%. If it doesn’t? Roll it back instantly. The next inbound interaction never touches the failed experience.

This isn’t A/B testing a webpage. It’s controlled experimentation on live customer conversations with zero-risk rollback – built natively into the platform.

“Everybody hates it. They’re getting stuck in loops… Great. Roll it back,” Puent said. “Next inbound call is not at risk of going down there.”

For enterprise teams, that combination of controlled exposure, measurement, and rollback is what turns AI agent rollout into a repeatable operating model, rather than a high-stakes bet. It also transforms each deployment stage from a binary launch into a series of evidence-based trust decisions, building organizational confidence alongside customer confidence, incrementally and measurably.

The Testing Shift: Human-In-The-Loop Still Matters, But It Cannot Scale Alone

Shen and Puent also pointed to a broader trend: enterprises are rethinking testing as AI agents become more autonomous and cover more use cases.

“While human-in-the-loop is key to building confidence and earning the trust of the business… it has long-term limitations to scalability as AI agents evolve to handle an increasing number of use cases and scenarios that require frequent updates.”

They argued that moving toward autonomous testing is no longer optional. As AI agents expand, the volume of scenarios requiring validation will easily outpace a human team’s capacity. Scaling successfully requires a platform where simulation and continuous validation are native capabilities. Otherwise, human-in-the-loop testing becomes a bottleneck that slows every new use case.

“Integrating testing into both the build and deployment processes is key to successful deployment.”

Observability Is The Difference Between Guessing and Improving

Once systems are live, observability becomes the engine for continuous improvement. Puent emphasized that backend data is what allows teams to make decisions based on reality instead of assumptions.

“Having data on the backend allows for data-driven decisions, enabling current interactions to inform future experiences,” he said. “Customer expectations, language, and needs will continue to evolve. You need data to keep your finger on the pulse…”

This is where AI agents become either a strategic advantage or a reputational risk. Without observability, leaders can only react after damage is done. With it, teams can detect failure patterns, refine knowledge, tune flows, and prove readiness for the next stage of rollout.

Observability is not a feature; it is the operating requirement for deploying AI agents at scale with confidence. Without it, trust in your AI program is a guess. With it, trust becomes a metric you can manage, improve, and prove to every stakeholder who needs to sign off on the next expansion.

One Piece of Advice Before You Approve AI Agents For Production

When asked what leaders should do before approving an AI agent for production, both guests focused on momentum paired with rigor.

Puent’s advice was about starting and iterating:

“Start now, action wins every time, start small, iterate, learn, improve, and keep driving forward.”

Shen’s advice was about grounding validation in real customer data and treating testing as continuous:

“Test thoroughly using real data, not assumptions. Identify possible scenarios from your actual data and create tests based on that. Testing isn’t a one-time process; continuously update your tests as you scale and expand your AI agent experiences…”

Together, the message is clear: move fast, but only within a framework designed to protect trust.

The Bottom Line: Trust Is The KPI That Outlives the Pilot

AI agents are quickly becoming the front door to the enterprise, and that changes what “ready” means. In customer experience, a failure is not just a bug. It is a moment the customer remembers. 43% of customers are now willing to switch providers after a single poor service experience, a number that has climbed every year for three consecutive years When tied to Forrester’s data showing CX improvements can drive over $1B in revenue, it becomes clear that a single bad AI handoff isn’t just a technical glitch; it’s a direct hit to profitability.

Shen and Puent are not arguing for hesitation. They are arguing for readiness: phased rollouts, large-scale simulation, testing that keeps up with change, and observability that enables improvement.

The Business Case You Bring to Your Leadership Team

If you’re a CX leader reading this and nodding — but wondering how to get your C-Suite to fund the discipline — here’s the framing that works: the cost of not proving AI agents before deployment isn’t a technical debt. It’s a trust debt that compounds quarterly. Every unvalidated interaction is a retention risk. Every retention risk has a dollar value. And the cost of winning back a customer who left because your AI failed them is 5-25x what it would have cost to keep them. This isn’t a technology investment. It’s a retention insurance policy with measurable ROI.

Agentic AI Agentic AI in Customer ServiceAI Agent AI Agents Artificial Intelligence Autonomous Agents Customer Retention Strategies

Tony Shen Jeremy Puent