Speech-to-text technology (also known as automatic speech recognition) is not a new concept in the contact center landscape. In the 1950s and 60s, innovators were already using speech recognition systems to identify people speaking digits out loud. These solutions helped to inspire the creation of countless future CX tools.
Now, speech-to-text models are used in everything from IVR workflows, to payment processing, conversational analytics, and self-service systems. As voice data continues to be one of the most important resources in a contact center, speech-to-text technology have grown increasingly more advanced, leveraging new forms of natural language processing, understanding, and machine learning.
The question is, how accurate are speech-to-text solutions in 2023?
The Importance of Accuracy in Speech-to-Text
Phone conversations are still the main way businesses and consumers interact. However, manual conversation analysis requires significant time and effort. Speech analysis software, leveraging automatic speech recognition (ASR) technology and speech-to-text can alleviate these problems.
Speech-to-text technology can be used for everything from the creation of powerful AI chatbots, to discovering business insights and intelligence. However, the value of any speech-to-text offering hinges entirely on its accuracy. Over the years, voice assistants and AI tools have been able to achieve higher levels of accuracy, thanks to improved algorithms and data sources.
“The Speech AI and generative AI tools built on top of audio and video data today are only as useful as the accuracy of the conversational data being fed into them. If the transcript isn’t accurate, the insights and suggestions the AI models output won’t be accurate either. Highly accurate speech-to-text is critical for building a strong foundation,” says Prachie Banthia, VP of Product at AssemblyAI.
However, despite the sophisticated nature of AI solutions for processing human language today, many speech-to-text systems still suffer from various accuracy issues. Benchmarks published in 2021 found that Amazon’s speech-to-text technology still had an error rate of 18.42%, Microsoft’s error rate ranged at 16.51% and Google video was ranked at 15.82%.
Ultimately, no speech-to-text solution is 100% accurate. All of these systems encounter limitations, whether it’s struggling to understand accents or dialogues, or being unable to distinguish quiet speech from large amounts of background noise.
Notably, speech-to-text AI models can also only recognize words in their existing database. This means it’s possible for some models to overlook certain words and phrases entirely.
Measuring Speech-to-Text Accuracy: WER
Since the value of any speech-to-text solution depends on its accuracy, companies creating intelligent systems need to regularly test and evaluate their systems for potential discrepancies. The most common way of measuring the accuracy of a speech-to-text solution, is Word Error Rate, or WER.
Word Error Rate calculates the exact number of errors present in a transcription produced by an ASR system, compared to a human transcription. It’s the de-facto standard for measuring how effective an automatic speech recognition tool, or text-to-speech solution is.
However, WER may not be the only metric worth examining when analyzing the utility of a speech recognition system. Word Error Rate can tell companies and developers how different an automatic transcription is to a human transcription. However, WER isn’t smart. It can’t account for context and legibility, but instead only looks at substitutions, deletions, and insertions in texts.
A more comprehensive review of speech recognition systems may look at several other metrics, such as proper noun evaluation, using the right data set and normalizing speech. Proper noun evaluation helps evaluators better understand the model’s ability to recognize names and proper nouns. Alternatively, choosing the right evaluation data sets ensures businesses can be confident in the results of each evaluation. Plus, normalization strategies lead to more standardization in the comparison of AI-generated and human transcripts.
How Accurate Can Speech Recognition Be?
Although speech recognition software has been around for a number of decades, there’s still a lot of work to be done before a computer or algorithm can provide a fully accurate transcription of a conversation. New AI initiatives and algorithms are appearing to assist with this process, but there are vast differences between the accuracy rates of different tools.
Many companies still need to train their AI engines within a text-to-speech solution with proprietary data, to ensure automatic transcriptions are as accurate as possible. Additionally, most brands need to constantly conduct tests to examine not just word error rate, but the legibility, contextual quality, and latency of transcriptions.
Fortunately, there are some innovators in the AI landscape currently developing more advanced, and accurate text-to-speech and speech recognition solutions. For instance, AssemblyAI, a leading AI partner, built their Conformer-1 technology to enhance the accuracy and value of the speech recognition neural net produced by Google Brain in 2020.
These changes lead to 29% faster inference times, and 36% faster training times. The Conformer-1 also features a modified version of Sparse Attention, a pruning method that assists in boosting the model’s accuracy when working with noisy audio. The result is a solution more robust and effective with real-world audio, making 43% fewer errors on noisy data.
The Evolution of Speech to Text Accuracy
Developments in AI algorithms and technology are powering a new future for speech recognition and speech-to-text technologies. This is crucial at a time when companies are investing more time and effort in the development of conversational bots, self-service solutions, and analytical tools.
As the landscape continues to expand, innovative AI companies are working on producing ever-more advanced tools for various use cases. For instance, AssemblyAI introduced the Conformer-2 service, which builds on the Conformer-1 with higher speed and performance metrics.
Conformer-2 retains parity with the Conformer-1 service in terms of word error rate, however it goes deeper into the realm of other user-oriented metrics. For instance, Conformer-2 can achieve a 31.7% improvement on alphanumerics, a 12% improvement in robustness to noise, and a 6.8% improvement in Proper Noun error rate.
These improvements were made possible by the increased availability of training data to the AI development team, and the advanced number of models used for pseudo data labelling.
“Thanks to significant advances in AI research, speech-to-text models are more accurate today than ever before. The best models are trained on enormous amounts of data and achieve near-human level accuracy on a wide variety of datasets. This makes them a profoundly useful tool for contact center platforms,” says Dylan Fox, Founder and CEO at AssemblyAI.
As the contact center and technology spaces continue to transform, conversational AI solutions and speech-to-text tools will become increasingly more accurate, reliable, and efficient.