Speech-to-text technology uses artificial intelligence (AI) to automatically transcribe raw speech into structured written text. Speech to text works with live audio and recorded files equally well, aiming to give you accurate transcripts that you can use for sentiment analysis, quality assurance, and agent training. But there is a catch here – how reliable is speech-to-text, really?
The State of Speech-to-Text Accuracy in 2021
First, let’s understand how speech-to-text accuracy is measured.
The reliability of speech-to-text hinges on its accuracy rate – i.e., how many errors it would contain on an average. This is measured in Word Error Rate (WER) which is the percentage of errors for every 100 words. Technically, accuracy is the exact inverse of WER; if a piece of transcribed text contains 2% of errors, then it means that it is 98% accurate. Either way, knowing a speech-to-text engine’s WER is essential to understanding how reliable it actually is.
Surprisingly, despite the sophisticated nature of AI today, the average WER for speech-to-text is far from 100%. As per benchmarks published in March 2020, Amazon had an accuracy of 73% (i.e., 27% WER), Microsoft was 78% accurate, Google came in at 79%, and Rev.ai (a dedicated speech-to-text engine provider) scored a slightly better 84%.
This means that in 1000 words of written text, you would have at least 160 incorrectly transcribed words as per the above benchmarks.
Of course, these numbers are subject to testing conditions and the complexity of tasks thrown at it. For example, a May benchmark found Microsoft to be 81.01% accurate, AWS to be 83.12% accurate, and Google largely the same at 84.46%.
Like Rev, dedicated provider Temi also premises better reliability at 13.9% WER or 86.1% accuracy.
In other words, you should expect anywhere between 15-25 errors for every 100 words transcribed across the leading speech-to-text engines available today.
How to Make Speech-to-Text More Reliable?
There are two ways you can improve accuracy rates for automatic transcriptions – training the AI engine and reducing interference.
You can train the AI to accurately interpret the specific accent, inflexion, and voice modulation commonly used by your agents by feeding the engine pre-recorded audio files. You could also work on reducing interference in the calling vicinity, by using superior quality microphones, keeping ambient noise levels to a minimum, and eliminating sudden interruptions.
Finally, you could specially train the AI to accurately transcribe industry-specific terminology that could be commonly used by your agents, but may not be so commonplace enough for the engine to pick up correctly the very first time around.
Apart from Accuracy, is There Anything Else to Consider?
While WER is undeniably the most relevant indicator of speech-to-text reliability, there is also another factor involved – latency.
Transcription latency reflects the number of seconds it takes for the speech-to-text engine to convert raw audio to a workable transcription. It may be a good idea to opt for a solution with a longer latency period, but one that ensures WER is kept to a minimum by taking more time to process the audio.