From an organizational perspective, the case for multimodal AI agents in CX is fairly open and shut.
Multimodal can enable richer, visual, voice-enabled interactions that lead to better outcomes, reduce customer effort, and build the kind of trust that text-only AI consistently struggles to earn.
So why are some organizations still seemingly wary of adopting the technology?
For Shan Lilja, Co-Founder of Mavenoid, it’s more about familiarity than capability:
“Companies worry about how customers will react to multimodal AI agents. But for customers, it’s a far more natural experience.”
While he concedes that this can be a little daunting for companies, he explains that the full multimodal leap doesn’t have to happen all at once.
Multimodal adoption is a journey with distinct phases, each one delivering real value independently, and the first step is considerably more accessible than most organizations assume.
Start Where the Difference Is Obvious
The instinct when approaching a new capability is often to either go all in or hold back entirely. Lilja’s advice is to do neither.
Instead, he encourages organizations to “design a set of steps so that each step is a no-brainer.”
That means beginning with support scenarios where multimodal is not marginally better than text, but, according to Liljas, can deliver “10x better” moments.
Strong use cases for multimodal include:
- Troubleshooting a complex physical product
- Onboarding a customer through a multi-step hardware installation with their hands occupied
- Processing a warranty claim where sharing images or video requires far less effort than a verbal description
These are the use cases where the case for multimodal is so self-evident that adoption becomes much easier to drive for both the brand and the customer.
The Four Phases
From there, Lilja maps out a progression that takes organizations from their current state to the frontier of what’s now possible.
Phase one is channel switching, which he refers to as “pre-multimodal.” Voice automation detects that a visual element would help and sends the customer a text link with an image or guide.
The two modalities aren’t unified, but they’re connected. It’s a meaningful first step that improves the customer experience.
Phase two brings those modalities together. Voice, text, and visual inputs all flow into a single session, allowing the customer to talk, type, and share images without switching context.
“Friction is much lower,” says Lilja. This is where the Broan-NuTone results were largely delivered: call drop rate fell from 25% to 7%, speed of answer improved fourfold, and service level score almost doubled from 43% to 80%.
As Don Lyskawa, Product Support Manager at Broan-NuTone, put it:
“The combination of digital self-service and voice automation has significantly reduced pressure on our team and improved the overall experience of our customers.”
Phase three is what Lilja calls being “hardcore multimodal” – a unified communication experience where customers can interact through any combination of pictures, voice, and text, with the AI orchestrating across all of them in real time.
This is the phase that tackles the most complex support interactions: simultaneous voice guidance while visual instructions update on-screen, with device telemetry feeding context into the AI as the customer works.
For products that are genuinely difficult to support in any other way, this is where resolution rates climb most steeply.
Phase four is live AI video, which is currently in beta testing with customers.
In Lilja’s framing, the customer’s phone becomes a “portal” into their home or workspace. The AI can see the environment, identify the problem, and guide the fix in real time.
“I have 70% confidence that it’s within six months before enterprises adopt this,” Lilja says.
The privacy considerations are real, but he argues the friction reduction for customers is substantial enough that adoption will accelerate quickly.
The Quadrant Nobody’s Automating
One of the more counterintuitive arguments for multimodal is where the efficiency gains actually come from.
The conventional wisdom around support automation focuses on high-volume, low-complexity interactions – quick queries that AI can handle cheaply and fast. But Lilja points to a different opportunity: the interactions that are routine but genuinely time-consuming.
For example, 30 or 40-minute troubleshooting sessions that follow predictable patterns but require enough back-and-forth that they’ve stayed largely with human agents. He says,
“Automating the repetitive but very complex cases frees up a disproportionate amount of time.”
Multimodal makes that automation possible in a way that text-based AI can’t, because these interactions require real-time feedback, visual confirmation, and the kind of guided step-by-step support that only works when both parties can see what’s happening.
What Happens to Agents
The natural question that follows any discussion of automation is what it means for the people currently doing that work.
Lilja argues that freeing agents from routine, time-consuming customer requests doesn’t eliminate their role; it concentrates it.
In doing so, it allows agents to focus on complex judgment calls, high-stakes customer relationships, situations that require emotional intelligence, and the kind of contextual reading that AI still can’t replicate reliably.
“The agent becomes more of a specialist,” he says. “They’re handling the calls that genuinely need a human – not because the AI couldn’t find the answer, but because the situation requires creativity, judgment, or empathy.”
The implication is that the agents who remain in the loop should be handling the interactions where human involvement actually makes a measurable difference, rather than spending 40 minutes walking someone through a process that a well-designed video guide could handle just as well.
For CX leaders mapping out where to go next, that’s probably the most useful thing to take away: you don’t need to be at phase four to see real returns. Each step pays for itself.