Do you think AI has actually made customer support better?
For businesses, the answer is mostly yes. Costs have come down, coverage has expanded, and chatbots handle queries that would have once meant a 20-minute wait for a human agent.
But what about for the customer? That’s where things get a little blurrier.
In many instances, customers now have to parse a wall of generated text, judge whether what they’re reading is accurate, rephrase their question when the AI misses the point, and verify whatever instructions they’ve been given before they act on them.
This means that for many customer service interactions, the customer has to put in more work, not less.
This point was raised by Shan Lilja, Co-Founder of Mavenoid, in a recent discussion with CX Today:
“Company effort has gone down, but customer effort has risen.”
This separation between organizations and customers is at the crux of AI-assisted support right now – and Lilja argues that multimodal is the bridge that can fix the divide.
The Hidden Tax
The concept of customer effort isn’t new in CX. The Customer Effort Score has been a staple of support measurement for over a decade. But the way that effort manifests has changed with the introduction of large language models.
The old failure mode was friction from poor routing or slow response. The new one is cognitive load: the mental work required to interact productively with an AI that sounds confident but isn’t always right.
Lilja describes this as a “hidden tax on every AI interaction, paid by customers.”
As discussed above, these taxable events are familiar to anyone who’s recently used a support chatbot.
While these effort-sapping instances don’t show up on dashboards via containment rates or CSAT scores, they can really add up.
And this increased cognitive load can lead to abandoned sessions, repeat contacts, and a slow erosion of confidence.
Within this effort tax, Lilja points to a specific area that he believes to be one of the most damaging: “AI slop.”
“AI slop is low-quality, generic content produced by AI.”
This slop has the potential to cause serious damage to an organization’s customer service operations.
An AI that tells a customer to press the wrong button can damage their product or, in a worst-case scenario, create a safety issue.
And the liability implications are real. Earlier this year, Woolworths was forced to make adjustments to its AI chatbot after it falsely claimed to have an “angry mother” and presented itself as having personal family experiences.
Other examples include Air Canada having to pay compensation after its chatbot gave incorrect refund information, and a customer convincing DPD’s chatbot to swear and write a poem about “how terrible” a company DPD is.
These aren’t arguments against AI in support, but rather, they’re arguments for AI that’s grounded in something more than language.
Why Text Keeps Failing
The structural problem with text-only AI is that language is, as Lilja puts it, “a tree of possibilities.”
Every sentence carries a range of interpretations, and the AI picks one. When that choice is even slightly wrong, the support interaction goes off course, and the customer bears the cost of correcting it.
Images and visual context work differently. They constrain interpretation. A photograph of a cracked product component, a real-time view of an error light, a video showing exactly which cable is loose – these don’t give the AI a tree of possibilities; they give it a scene, as Lilja explains:
“It’s harder to b******t a human with a false image than with false words.”
“Reality is more constrained. That’s how you bring hallucination rates down – by grounding things in more types of information.”
This is what Lilja refers to as “visual grounding,” and it’s one of six properties that define effective multimodal support:
- Enhanced context
- Reduced ambiguity
- Cross-modal consistency
- State awareness
- Real-time feedback
- Visual grounding
Together, they address the failure modes of text-only AI at a structural level, rather than just patching over them with better prompts.
The Feedback Loop Problem
One failure mode in particular stands out: delayed feedback.
In text-based support, a customer can follow a set of instructions for 10 or 15 minutes before discovering that step three was wrong.
By that point, they’ve potentially made things worse, and they’re starting over from scratch; except now they’re frustrated and have less trust in whatever the AI tells them next.
Real-time visual feedback can address this issue. A video guide that shows a customer cleaning a washing machine’s drain filter can flag immediately if they’re doing it incorrectly. A live visual check on a hardware installation can catch a misconnected cable before the customer powers the device back on and damages it.
It may be an old saying, but Lilja’s remark that “a picture is worth a thousand words,” is a surprisingly succinct argument for multimodal, particularly when the alternative is a hallucinated instruction in a customer support chat.
The brands that recognize this will be able to elevate their service levels from merely improving their NPS to building support that customers can actually rely on – where the AI’s confidence is backed by something real.