Complex Products Need More Than a Chatbot

When customers are hands-on with a device, text creates friction – multimodal removes it

Chat Got Us Far, But Not Far Enough

The past decade of customer support innovation has been largely text-driven.

From chatbots to knowledge bases to AI-assisted ticketing, the default assumption has been that if you can get a customer to type their problem, you can get them to a solution.

For most queries, that works fine. Interactions such as account issues, billing questions, and order status updates can usually be solved through the power of the written word.

But for physical products, the gap between what a customer can describe and what a support agent needs to know is significant.

A customer troubleshooting a home ventilation system or setting up a new piece of hardware is often trying to follow fairly complicated instructions while their hands are occupied, sometimes in an unfamiliar environment, and frequently under stress.

Text creates friction at every step of that process. And friction in support compounds quickly.

The Case for Visual + Voice

Multimodal support combines voice, text, and visual elements (images, diagrams, video guidance) within a single connected experience.

The customer doesn’t have to switch channels, restart the conversation, or hunt for a YouTube tutorial. Everything they need is in the same place, delivered in the format that actually fits what they’re trying to do.

The practical impact of that can be considerable, as Lilja explains:

“We see 4x fewer abandoned calls with multimodal compared to text-only AI agents.”

“Take warranty claims as an example, rather than asking a customer to describe their issue over the phone, they can just take a picture.

“It’s a much smoother process – much less effort for the end user – and you can automate much more of it as a result.”

Part of what drives that engagement is the removal of the small but relentless frictions that erode text-based support.

Things that might sound minor – like serial numbers being misheard over the phone, difficulty understanding product instructions, and instances where customers have to put down the device to find another device to look something up – can have a major impact on the overall customer experience.

“If someone says ‘serial number SN800Z’ over the phone and it’s misheard, you end up repeating yourself,” Lilja explains. “Instead, it might be more appropriate to ask the user to enter the serial number via keyboard or take a photo of the label and detect it that way.

“Voice might not always be the best modality when accuracy is important.”

This is particularly impactful when you consider just how widespread and frustrating customer repetition is.

Zendesk’s 2026 CX Trends report revealed that 74% of consumers find it frustrating to have to repeat information to agents. Moreover, another report released by Glance found that only seven percent of customers say they rarely or never have to repeat themselves.

This means that almost three-quarters of customers find repetition frustrating, yet 93% run into the issue on a regular basis.

If your organization is able to introduce a system or solution that can largely eliminate this problem, the potential upside could be significant.

Resolution, Not Just Response

There’s a tendency in the industry to treat multimodal as a UX preference; something that makes the experience nicer, but not necessarily better.

Lilja pushes back on that framing directly, arguing that for complex products especially, the choice of channel can actually determine whether the interaction ends in resolution, or just in the appearance of one.

“When we’re dealing with physical products, atoms instead of just bits, the stakes are higher,” he explains.

“If we give the wrong instructions, we could cause harm to the user or damage the device. For us, multimodal isn’t optional; it’s essential.”

The business case follows from that. Brands that get resolution right the first time have happier customers, fewer callbacks, fewer incorrect part orders, and fewer unwarranted returns.

The metrics that matter to operations leaders and finance teams are downstream of whether the support interaction actually solves the problem.

Keeping the Starting Point Simple

One practical concern that comes up consistently when CX leaders consider multimodal is implementation complexity.

The assumption is that visual + voice support means a major technology overhaul, encompassing months of integration work, significant IT involvement, and an uncertain return.

Lilja’s view is that the starting point doesn’t have to be that dramatic:

“The main approach is to do it iteratively: design a set of steps so that each step is a no-brainer.

“Don’t add all modalities at once. Start where the gain is obvious, like seeing what you’re trying to troubleshoot.”

Lilja encourages organizations to identify the support scenarios where multimodal would make the biggest difference (the interactions where text genuinely can’t get the job done) and start there. Build the case internally, then expand.

For many brands, that means beginning with a hybrid of voice automation and visual guidance, without necessarily attempting a full platform rebuild from day one.

In a world where physical products are getting more connected and more complex, support that can only see through words is working with a significant handicap.

You can learn more about Mavenoid’s multimodal approach by visiting the website.

Artificial Intelligence Call & Contact Center Software Conversational AI Omni-channel Self Service

Shan Lilja