The Last Support Revolution: How Multimodal AI Is Reinventing CX

The technology can finally meet the full complexity of what support requires. Here's what that means for your business

Why This Time is Different

The previous three revolutions were about reach and speed. Post made remote support possible, telephony made it immediate, and the internet made it scalable and always available.

What none of them solved was understanding. A support agent reading a letter still couldn’t see the problem; a telephone operative still had to rely on whatever the customer could describe in words; and a chatbot, however sophisticated, still operates primarily in language.

When language is all you have, a significant portion of what the customer is actually experiencing gets lost in translation.

Multimodal AI changes the dynamic by altering the information available to the system.

By combining vision, audio, and telemetry from connected devices, the AI is moving beyond interpreting a customer’s description of a problem to actually seeing and hearing the problem.

“You can spray unlimited intelligence on the situation,” says Lilja. “The next constraint becomes what the system can perceive. Now that constraint is coming down.”

The implications of that are broader than most support leaders have yet fully reckoned with.

The Three False Trade-offs

For much of the past decade, design decisions in AI-powered support have been framed as either/or choices.

Either you use visual guidance, or you use voice; you don’t do both in a single session. Either you use generative AI for flexibility, or you use curated, deterministic workflows for accuracy; you pick a side. Either you automate, or you keep humans in the loop.

Lilja argues that multimodal makes all three of these trade-offs unnecessary, and that they were always a function of technical limitation rather than genuine strategic choice.

Great AI support, in his framing, rejects these binaries: voice and visual together, generative breadth with structured grounding, human direction with AI doing the driving.

Across Mavenoid’s customer base, engagement with multimodal support sits consistently above 85% – a figure that reflects not just novelty, but genuine usefulness.

Customers who can show rather than describe, and receive visual confirmation rather than written instruction, stay in the session and see it through. That engagement is what makes everything else downstream (resolution rates, repeat contact reduction, parts conversion) actually move.

From Reactive to Proactive

The most significant implication of the current shift is what it makes possible in terms of proactive support.

The historical model has always been the same: customer encounters a problem, customer initiates contact, brand responds. The customer is always one step ahead of the support system, because the system can only respond to what it’s told.

Connected hardware changes that equation. When a product can transmit its own status data, the AI doesn’t have to wait to be told there’s a problem; it already knows.

Lilja points to Husqvarna as an example of what this looks like in practice. IoT alarm codes flag issues before the customer has even noticed them. The customer receives a notification, clicks through, and reaches a resolution path immediately – without the escalating frustration of a problem that’s been getting worse while they tried to figure out what was wrong.

When support is fast, accurate, and proactive, it stops being something customers dread and starts being something that actually builds loyalty.

Moreover, the product relationship doesn’t end at the point of sale. Indeed, for many hardware brands, support is where it’s really tested.

The Cost of Standing Still

For CX leaders still working primarily with text-based support, the pressure from this shift is increasingly hard to ignore.

Customer expectations have been reset by consumer AI, with tools like ChatGPT and Gemini that interact multimodally now seen as a baseline, not as a premium feature.

A customer who photographs a car engine to ask what’s wrong with it isn’t going to be satisfied with a support chatbot that asks them to describe the problem in writing. Lilja says,

“If you’re stuck with a wall of text, users will abandon the session.”

The downstream consequences – lower self-service rates, higher human support costs, brand damage – may not show up on a multimodal readiness dashboard. Still, they do show up quietly, in metrics that seem unrelated to the channel decision.

The brands building multimodal infrastructure now are able to fix today’s support experience and accumulate an advantage that compounds over time. This can include better data, better AI training, and better customer relationships – all while their competitors are still deciding whether to start.

Lilja’s own read on the moment is fairly blunt: “End of history narratives are always false. But it sure feels like we’re approaching the end of history for support as we know it.”

This is not because innovation stops here, but because for the first time, the technology can meet the full complexity of what support actually requires.

Agentic AI Agentic AI in Customer ServiceAI Agents Conversational AI Conversational Support Software Customer Service Automation Software Digital Customer Experience (DCX)Generative AI Remote Video Support Software

Shan Lilja