OpenAI’s Latest Moves Put Many Voice AI Startups on Notice

Among other moves, the AI juggernaut has made it simpler to build AI agents for voice-over-phone scenarios, like customer support

Customer Analytics & Intelligence News

Published: September 1, 2025

Charlie Mitchell

OpenAI has released its “most advanced speech-to-speech model yet”: gpt-realtime.

The AI giant also took the wraps off its Realtime API, now generally available with new capabilities as it moves out of beta.

OpenAI hopes enterprises and developers will leverage both the model and the API to build “production-ready voice agents”.

Some of the API’s latest features will help here. For instance, the new ability of the Realtime API to support image inputs and remote MCP servers will make these agents more capable.

Yet, there’s also an exciting capability to develop better customer support voice AI agents.

As Peter Bakkum, Member of Technical Staff at OpenAI, said in the announcement video:

We’ve added support for SIP telephony, which makes it much easier to build applications for voice-over-phone situations like customer support.

With this, a developer could easily grab a phone number from Twilio, feed that into the SIP interface provided by OpenAI, add prompts, feed it data, and let it go.

As Andreas Granig, CEO at Sipfront, observed in a LinkedIn post, that is quite the threat to many conversational AI startups.

“There are quite some startups, who only provide an interface to the public phone network for existing speech-to-speech AI services, often without much telco moat, but relying mostly on Twilio and others… They are in hot water now,” noted Granig.

The CEO acknowledged that startups specializing in tool calling for advanced integrations remain safe, since that remains a specialist field. However, he added:

The voice interface for AI assistants just became [a] commodity.

As a result, it will be more difficult to differentiate use cases for AI assistants, signalling to many conversational AI startups that now is the time to step up.

What About the New gpt-realtime Model?

OpenAI hopes many customer support teams will leverage gpt-realtime, alongside the Realtime API, as they advance their customer support automation strategies.

Indeed, as Peter Bakkum, Member of Technical Staff at OpenAI, said in the announcement video:

We carefully aligned the model… to real scenarios like customer support and academic tutoring.

There are many reasons why support leaders would consider the gpt-realtime model. For starters, it enables AI agents that can understand and produce audio without relying on separate transcription, language, and voice models.

Additionally, there are performance benefits. For instance, these agents will respond faster, as it’s just one model, and capture subtleties like laughter or sighs while expressing various emotions.

OpenAI also claims the model can deliver more natural, high-quality audio while following instructions across complex, multi-turn conversations.

Developers can also adjust pace, tone, style, and even roleplay characters.

Meanwhile, OpenAI claims the model can better handle unclear audio and long alphanumeric strings, like phone and license numbers. One study recently highlighted these strings as a big problem for rep-facing AI assistants leveraged in contact centers.

However, despite all the model’s advantages, there are cautions.

For instance, its cost is relatively high at $32 / 1M audio input tokens ($0.40 for cached input tokens) and $64 / 1M audio output tokens.

As such, Alex Levin, CEO at Regal, estimated that the cost of the speech-to-speech model is still approximately four times higher than chaining a speech-to-text (STT), large language model (LLM), text-to-speech (TTS) pipeline for Voice AI Agents.

In a social post, the CEO also cautioned toward limited control over the model. He wrote:

The Realtime model is missing the control/observability that Voice AI Agent companies have in the “chained” model.

“And it’s missing the ability to vary the model, voice, guardrails, etc, in each step of the conversation, which is currently easily achieved with a multi-state agent builder and a “chained” model today.”

Despite these concerns, some enterprises are working with OpenAI to start testing the model, including T-Mobile…

T-Mobile Uses gpt-realtime for Customer Conversations

T-Mobile has tested OpenAI’s models for six months and recently unlocked access to gpt-realtime. Together with the Realtime API, it claims to have already seen “huge improvements”.

In the announcement video, Julianne Roberson, Director of AI at T-Mobile, highlighted how T-Mobile is already experimenting with the model to reimagine the device upgrade process, one of its most common demand drivers.

During the demo, Roberson showed how the AI assistant guided a customer through selecting a phone under $300, checked compatibility with satellite services, and confirmed plan eligibility.

In doing so, she emphasized that the model feels far more human, able to follow customers through unpredictable conversations while recognizing emotions and handling multimodal inputs.

These multimodal capabilities will boost T-Mobile’s objective to provide “expert-level service everywhere” with AI.

Given its close ties to OpenAI, it will be fascinating to see how this partnership develops, and whether T-Mobile shares CEO Sam Altman’s prediction of the end of human customer service.

Agentic AI AI Agent AI Agents Artificial Intelligence Automation Chatbots Virtual Agent Virtual Assistant