Google Unveils Its Latest Voice Innovations

As Google’s voice portfolio expands, so do the customer experience opportunities

Conversational AI News Analysis

Published: October 24, 2022

Charlie Mitchell

Google has launched two new voice innovations: its Speech-to-Text API v2 and Speech On-Device.

The first is the next evolution of its Speech-to-Text API, an automatic speech recognition (ASR) system that allows developers to send audio to the cloud and get the transcription text directly back.

The second is a new offering that combines all of Google’s speech services, making them available locally on embedded devices.

Yet, to fully get to grips with these, first consider Google’s current voice portfolio.

Google’s Current Voice Portfolio

Since releasing its first speech patent in 2001, Google has led the way in voice innovation. From interacting with Google Assistant to live captioning in Google Meet, it now boasts an extensive voice suite of tools.

Within this are two core innovations: its Speech-to-Text and Text-to-Speech APIs.

The Speech-to-Text API supports short and long form speech in over 75 languages and 120+ locales – out-of-the-box – without the need for training and customization.

Of course, for some use cases, businesses may demand customization. As such, the API is flexible, allowing users to harness it across various audio channels. It also detects multiple speakers in the same channel, with the solution recognizing their unique voices.

Sharing more at the Google Next event, Calum Barnes, Head of Product for Cloud Speech at Google Cloud, said:

Companies can train the system for specific use cases, so it understands specific company and product names, industry jargon etc. A Speech Adaption API is also available to help this tuning process.

Moreover, companies can create captions and subtitles for media content or build a virtual agent. Yet, it is also possible to use the technology for speech analysis, summarization, and extraction – each of which has significant potential for contact centers.

In tandem, many businesses harness Google’s Text-to-Speech API to communicate with their users. It allows them to take text and synthesize it into audio in a single step.

For this synthetic voice, Google offers 400 voices in 50+ languages and locales – all available within a single integration – allowing businesses to build their own voice applications.

Innovation 1: Speech-to-Text API v2

The next generation of Google’s Speech-to-Text API is now available in public preview.

The API centers on “conformer models” that Google released in April. These “deep learning” end-to-end models take audio and convert it directly into text.

As such, they are superior to other ASR techniques that require intermediary steps, allowing for easier optimization and greater quality.

Therefore, Google can ensure better accuracy across different accents and acoustic environments with various microphones and background noises.

Having proved popular, Google wanted to push these models further and – as a result – launched Speech-to-Text API v2.

Announcing its release, Barnes added:

Speech-to-Text API v2 will follow a modern, resourceful API design. This will allow us to bring many new Google Cloud-wide features – like single region data residency and a Cloud audit login – to businesses as part of the Speech-to-Text API.

The solution is also ready for future conformers, “ultra-large speech models”, and whatever else may come down the pipeline from Google Research.

Finally, despite the announcement, the Speech-to-Text API v1 is not going away. As such, Google is not pressuring businesses to migrate or alter their code to harness v2.

Innovation 2: Speech On-Device

Speech On-Device is a new offering – which is now generally available – that makes Google’s speech services available locally.

As a result, businesses may harness Google’s ASR and speech synthesis models through an embedded device, with no connectivity required.

By enabling low latency speech experiences, Google preserves users’ privacy, as they can rest assured that their voice and speech data is never leaving the local device.

Meanwhile, Barnes assured all adopters that the solution does not compromise on quality. He stated:

These are not terse speech models that only work for specific situations, commands, or words. These are the same great models that you’ve used and are used to on the Cloud-side, available to run locally, with just a fraction of the computing power necessary.

The offering runs on IOS and Android devices, and users of Google’s Speech APIs can contact their sellers to get started.

Google’s Vision for Voice

“Our goal at Google Cloud is to take all of this technology, and all the expertise around speech and voice built up at Google over the years, and make them available for enterprises to use in their own applications and to power their own voice experiences,” said Barnes.

Such a vision is clear from the latest offerings, which will support the next generation of voice interfaces.

Also, its evolved Speech-to-Text API offering may add weight to its new CCaaS platform, paving the way for advanced virtual agents and speech analytics.

Combining such analytics with its BI solution – a strategy that is likely considering the latest Looker enhancements – will help drive insights across the modern enterprise.

This example highlights how Google is strengthening many elements within the CX stack, increasing its interoperability, and supporting businesses in new, inventive ways.

CCaaS

Brands mentioned in this article.

Google