Amazon has launched an exciting new feature for Alexa – Live Translation. The feature allows users speaking in two different languages to converse with each other, with Alexa acting as an interpreter and translating both sides of the conversation.
With this add on, a customer can ask Alexa to initiate a translation session for a pair of languages. Once the session has ended, users can speak phrases or sentences in either language and Alexa will automatically identify which language is being spoken and translate each side of the conversation.
Currently, the feature works with six language pairs — English and Spanish, French, German, Italian, Brazilian Portuguese, or Hindi — on Echo devices with locale set to English US.
The feature leverages several existing Amazon systems. These include Alexa’s automatic-speech-recognition (ASR) system, Amazon Translate, and Alexa’s text-to-speech system, with the overall architecture and machine learning models designed and further optimised for conversational-speech translation.
Language ID
Alexa runs two ASR models in parallel during a translation session, along with a separate model for language identification. Input speech passes to both ASR models at once. Based on the language ID model’s classification result, only one ASR model’s output is sent to the translation engine.
This parallel implementation is important in maintaining the latency of the translation request, as waiting to begin speech recognition until the language ID model has returned a result would delay the playback of the translated audio, Amazon confirmed.
The company also learned that the language ID model is most effective when it bases its decision on both acoustic information about the speech signal and the outputs of both ASR models.
Once the language ID system has selected a language, the associated ASR output is processed and forwarded to Amazon Translate. The result is subsequently passed to Alexa’s text-to-speech system and played back to users.
Speech Recognition
Like most ASR systems, the ones used by Amazon for live translation are acoustic and language models. The acoustic converts audio into the smallest units of speech – called phonemes – whereas the language model encodes the probabilities of particular strings of words. This then helps the ASR system decide between alternative interpretations of the same sequence of phonemes.
To adapt the acoustic models, Amazon used connectionist temporal classification (CTC), followed by multiple passes of state-level minimum-Bayes-risk (sMBR) training. Amazon also mixed noise into the training set, enabling the model to focus on characteristics of the input signal that vary less under different acoustic conditions. This helped make the acoustic model more robust.
The Detail
Adapting to conversational speech also required modification of Alexa’s end-pointer. This determines when a customer has finished speaking. The end-pointer already distinguishes between pauses at the ends of sentences, which indicates that the user has finished speaking and that Alexa needs to follow up, and mid-sentence pauses, and may be needed to go on for a longer period. For Live Translation, Amazon modified the end-pointer to allow it to tolerate longer pauses at the ends of sentences, as speakers engaged in long conversations will often take time between sentences to formulate their thoughts.