DefinedCrowd Releases Accented Speech Datasets

The first dataset contains accented English data from over 15 countries

Conversational AI Latest News

Published: July 30, 2021

Sandra Radlovački

DefinedCrowd, the one-stop shop for high-quality artificial intelligence training data, released the first of a series of free Spanish-accented English speech datasets. These will allow AI developers to test how well their speech recognition models understand non-native English speakers, which is a demographic represented by over 35 million people in the U.S.

Dr. Daniela Braga, founder and CEO of DefinedCrowd, said: “There is an accent gap in speech technology. Research shows that speech recognition technologies are not nearly as accurate in understanding nonnative accents as they are in understanding white, non-immigrant, upper-middle-class Americans.”

Speech recognition technology can often be prone to bias. To combat this, DefinedCrows has released the first of four sets of Spanish-accented English speech datasets for developers to use, test, and benchmark their models to identify bias and areas which need more training data.

“Unfortunately, it has resulted in models that are more useful to some people than to others. And that must change,” said Dr. Braga.

Many companies do not have the resources to train or test their systems with different accents which makes speech recognition systems more unresponsive, inaccurate, and isolating experience to non-native English speakers.

The first dataset, released in two phases, includes Spanish-accented English data from the Americas, including Argentina, Brazil, Canada, Chile, Colombia, Dominican Republic, Guatemala, Honduras, Mexico, Nicaragua, Panama, Peru, the United States, Uruguay and Venezuela.

Christopher Shulby, Director of Machine Learning Engineering at DefinedCrowd, said:

“For companies with AI solutions to compete in the large nonnative English-speaking market in the U.S., speech models need to be able to understand a wide range of different Spanish accents, originating from all the Americas.”

Subsequent releases will include datasets from native Spanish speakers from around the world, including Australia, China, Finland, France, Germany, India, Israel, Italy, Norway, Portugal, Russia, Spain, Sweden, and the United Kingdom.