Last week, OpenAI released the latest iteration of its flagship large language model (LLM): GPT-4o.
The LLM stands apart in its multimodality, with the ability to reason in real-time across audio, vision, and text.
For years, building AI that understands multiple modalities has proved challenging. Just creating pipelines for tasks such as speech-to-text is tricky due to issues like high processing time.
Now, GPT-4o can do that almost instantly.
Yet, until this point, AI platform providers have invested heavily in such multimodal projects, funding that Rowan Trollope, CEO of Redis, now suggests is obsolete.
Taking to X (formerly Twitter), the once Five9 CEO stated:
Hundreds of millions of dollars of R&D in Contact Center AI Agents just became obsolete with OpenAI GPT-4o developments. The attention should be on back-end automation
It would not be the first time ChatGPT has made R&D into contact center innovation obsolete, either.
Just consider how earlier releases of the LLM have undone the work of many analytics providers, which spent hundreds of hours engineering natural language processing (NLP) models to gauge intent, sentiment, and more. A LLM can extract all this out of the box.
Another example is agent-assist innovation. While businesses once spent significant R&D resources building use cases like isolating key data points within a customer conversation, ChatGPT and other LLMs can do so instantaneously.
Indeed, that is ultimately why many businesses look to first implement LLMs in the contact center. The use cases were already there, but they are now far more accessible.
Consider the following graphic from an October 2023 Gartner study. It showcases how customer service is a prominent recipient of enterprise GenAI investment.
Expect this to continue, and – with the introduction of multimodal LLMs – the use cases will become more innovative. Real-time translation is an excellent example.
Real-Time Translation and Other Multimodal Use Cases
In recent years, conversational AI vendors have brought various real-time translation models to market, with brands like Cognigy even making them available on the voice channel.
Typically, these apps will first use speech-to-text to develop a transcript from the customer’s audio.
That transcript then feeds through a translation engine – like Google Translate – and the agent receives a text translation within their workspace.
From there, the agent types out their reply, which – via the engine – translates back to the original language and plays out through a text-to-speech audio stream.
The primary issue with these experiences is the dead air between the customer speaking and the agent typing back a response. It’s a rapport killer.
Thankfully, with “off-the-shelf” real-time translation, a multimodal LLM may cut to the chase.
Just take a look at this example that OpenAI released of GPT-4o, translating a live conversation between native English and Spanish speakers.
Alongside translation, consider how AI can adjust the agent’s accent to one that’s much more familiar to the customer, ensuring full comprehension.
Krisp already offers this use case. Yet, with GPT-4.0, it may become much more widely available. After all, one of OpenAI’s demos showed a GPT-powered voice changing styles on the fly.
As a final example, think about how GPT-4o could transform customer-virtual agent conversations.
For instance, consider how many leading conversational AI vendors have augmented their solutions with image recognition (IR) to recognize entities within photos and make automated recommendations. A multimodal LLM provides this capability out-of-the-box.
Adding that capability to a virtual agent could bring many virtual agent use cases to life across various sectors, including retail, utilities, and the public sector.
Take a local council as an example of the latter. If a person tweets them with a photo of a faulty piece of street furniture and shares the location, GPT-4o could identify if it belongs to the council.
With that verification, the LLM could trigger an automated, personalized response and prompt a pre-planned workflow to resolve such issues.
Ensuring proper enterprise orchestration of such use cases is likely the next battleground – as Trollope suggested – especially given the sophistication of these out-of-the-box features.
GPT-4o: The Broader Enterprise Story
The launch of GPT-4o offers more insight into the future of customer interactions.
For example, the following demo of two GPT-4os interacting and singing perhaps sheds some light on a future when machine customers and agents converse on behalf of their human counterparts.
However, it’s also fascinating to consider how OpenAI chose to present all these demos through the smartphone, with GPT-4o flicking between modalities – almost becoming an extension of the senses.
That suggests it’s moving into the mobile market to further expand the spread of generative AI, which is perhaps unsurprising, given recent reports that OpenAI is in talks with Apple over a deeper integration of its technology in iOS.
Moreover, the smartphone examples highlight the impact that multi-modal LLMs could have on consumers’ daily workflows.
Yet, perhaps most pertinently, by opening up multimodal capabilities to a full audience of users, OpenAI takes us closer to real-time, cheaper AI in the enterprise.
For example, those in finance may utilize this model in workflows where they compare documents, find errors, and send emails.
Previously, they had to document these procedures step-by-step, script them, and create a slow, inflexible process flow. Now, AI can dynamically adapt and automate these workflows, improving efficiency significantly.
Elsewhere, consider a large consumer packaged goods (CPG) organization that uses planograms to manage product placement.
Conventionally, auditing these placements involved taking pictures and manual analysis. Now, with GPT-4.o, the company can analyze video footage in real-time, overcoming previous limitations like poor lighting or space constraints.
These are just two examples of many that highlight how GPT-4.o can automate complex workflows and enhance real-time interactions throughout the enterprise.
However, as Trollope inferred, its successful integration depends on developing back-end integrations, alongside customizing the model to specific needs and ensuring accurate, context-aware responses.