Using Accented English Data to Improve your AI Models

From：Nexdata Date： 2024-08-15

➤ Google Assistant better in Chinese - accented English

With the rapid development of artificial intelligence technology, high-quality data sets have become an important factor in promoting model accuracy and reliability. In many fields such as autonomous driving, smart security, and medical diagnosis, the role of data sets is irreplaceable. However, different application scenarios require different types and amounts of data. How to efficiently collect and use data sets is an important prerequisite for promoting the development of artificial intelligence technology.

According to reports, Vocalize.ai’s laboratory has performed a speech recognition capability test on Amazon’s voice assistant Alexa, Apple’s voice assistant Siri, and Google’s voice assistant Google Assistant. Researchers tested the three voice assistants to test how well they understand accented English in three countries: the United States, India, and China.

The result shows that Google Assistant completely surpasses the other two voice assistants in understanding Chinese accent English. The main reason for this result is that Google Assistant has learned a considerable Chinese accent English data while the other two voice assistants have not.

➤ Nexdata's English Datasets

As an international language, there is a big difference in English accents in different countries and regions. In some regions, the English accent is hard to understand as it sounds like another language. If the machine does not learn tons of accented English data in different regions, it is very likely that it will not be able to identify these English accents.

Currently, the ASR system has a high accuracy rate for standard English accent and can meet the commercial requirements of certain scenarios, but the shortage of accented English speech data has severely restricted the research of accented English recognition.

As the world’s leading AI data service provider, Nexdata has accented English data in dozens of countries and regions including the United Kingdom, the United States, China, Germany, India etc., covering various pronunciation style and accents, and has completed the phonetic transcription, accent annotation, and prosody annotation for the data, which can power the research of accented English.

Chinese English Datasets

More than 3,000 native Chinese participated in the recording of 100,000 common English sentences, covering the regions of Jiangsu, Shandong, Beijing, Henan, etc., and conformed to the specific accent of the Chinese speaking English. The recording text covers commonly used English sentences, rich in content, wide in scope, and balanced phonemes.

American English Datasets

Nearly 2,000 native American English speakers participated in the recording with authentic accents. The recording text is designed by language experts and is guided by interactive scenarios, covering multiple categories such as human machine interaction, in-car, home, and general etc.

British English Datasets

The data is recorded by 1,651 native British English speakers, with authentic accents. The recorded text covers multiple categories such as human machine interaction, in-car, home, and general etc.

German English Datasets

More than 1,000 native German English speakers participated in the recording with authentic accents. The recording text is designed by language experts, covering multiple categories such as human machine interaction, in-car, home, and general etc.

➤ Indian English Datasets by Nexdata

French English Datasets

More than 1,000 native French English speakers participated in the recording with authentic accents. The recording text covers multiple categories such as human machine interaction, in-car, home, and general etc. It’s recorded in a quiet room, covering the age group of 18 to 60 years old.

Indian English Datasets

The data is recorded by more than 2,000 native Indian English speakers, with authentic accents. The recorded text covers multiple categories such as human machine interaction, in-car, home, and general etc. The sentence accuracy is over 95%.

Nexdata is always committed to protecting participants’ interests, protecting data security, and respecting participants’ privacy. Nexdata has passed ISO27701, ISO27001 privacy information management system certification and ISO9001 quality management system certification.

If the above datasets cannot meet your current research needs, Nexdata can also provide data customization services for specific peoples, specific scenarios, and specific languages to help customers get satisfactory data services.

End

If you need data services, please feel free to contact us: info@nexdata.ai

The future intelligent system will increasingly rely on high-quality datasets to optimize decision-making and automated processes. In the era of data, companies and researchers need to continuously improve their ability of data collection and annotation to make sure the efficiency and accuracy of AI models. To gain an advantageous position in fiercely competitive market, we must laid a solid foundation in data.