How AI is Breaking the Language Barrier

From：Nexdata Date： 2024-08-15

➤ Meta AI's UST project and AI translation

In the modern field of artificial intelligence, the success of an algorithm depends on the quality of the data. As the importance of data in artificial intelligence models becomes increasingly prominent, it becomes crucial to collect and make full use of high-quality data. This article will help you better understand the core role of data in artificial intelligence programs.

Recently, Meta AI announced the launch of the Universal Speech Translator (UST) project, which aims to create an AI system that can perform real-time speech translation across all languages, even those that are commonly spoken but not commonly written.

According to Meta, the model is the first AI-powered speech translation system for real-time translation between the non-written language Hokkien and English. The project is working on developing more real-time speech-to-speech translations so that Metaverse residents can interact more easily.

With the development of neural networks and deep learning, the ability of AI translation has increased exponentially, and it can obtain results almost indistinguishable from human translation. Thanks to AI, the communication barrier of language will be easily overcome, making communication more convenient and interesting. According to the forecast of Global Market Insights, a US research company, by 2027, the translation market size will increase by 3.6 times, reaching US$3 billion.

➤ Speech data by mobile phone

As is known to us, data is the foundation of AI and machine learning. Today’s commercial ASR models are mostly trained on English datasets and thus have higher accuracy for English speech interactions. However, there are few small language training data in the market at present, and the scene is single and lacks challenges, which cannot reflect the generalization ability of the research model in large data volumes and complex scenes.

In order to allow people who speak minority languages to enjoy the convenience brought by artificial intelligenct, Nexdata has developed more than 100,000 hours of reading speech data , covering more than 60 languages and dialects around the world and multiple application scenarios.

211 Hours — German Speech Data by Mobile Phone_Reading

The data set contains 327 German native speakers’ speech data. The recording contents include economics, entertainment, news, oral, figure, letter, etc. Each sentence contains 10.3 words on average. Each sentence is repeated 1.4 times on average. All texts are manually transcribed to ensure the high accuracy.

1,441 Hours — Italian Speech Data by Mobile Phone

The data were recorded by 3,109 native Italian speakers with authentic Italian accents. The recorded content covers a wide range of categories such as general purpose, interactive, in car commands, home commands, etc. The recorded text is designed by a language expert, and the text is manually proofread with high accuracy. Match mainstream Android, Apple system phones

986 Hours — European Portuguese Speech Data by Mobile Phone

It is speech data of 2,000 Portuguese natives with authentic accents. The recorded text is designed by professional language experts and is rich in content, covering multiple categories such as general purpose, interactive, vehicle-mounted and household commands. The recording environment is quiet and without echo. The texts are manually transcribed with a high accuracy rate. Recording devices are mainstream Android phones and iPhones.

762 Hours — Spanish (Latin America) Speech Data by Mobile Phone

1,630 non-Spanish nationality native Spanish speakers such as Mexicans and Colombians participated in the recording with authentic accent. The recorded script is designed by linguists and cover a wide range of topics including generic, interactive, in-vehicle and home. The text is manually proofread with high accuracy. It matches with mainstream Android and Apple system phones.

1,002 Hours — Russian Speech Data by Mobile Phone

1960 Russian native speakers participated in the recording with authentic accent. The recorded script is designed by linguists and cover a wide range of topics including generic, interactive, in-vehicle and home. The text is manually proofread with high accuracy. It matches with mainstream Android and Apple system phones.

516 Hours — Korean Speech Data by Mobile Phone

The 516 Hours — Korean Speech Data of natural conversations collected by phone involved more than 1,077 native speakers, ehe duration of each speaker is around half an hour. developed with proper balance of gender ratio and geographical distribution. The recording devices are various mobile phones. The audio format is 16kHz, 16bit, uncompressed WAV, and all the speech data was recorded in quiet indoor environments. All the speech audio was manually transcript with text content, the start and end time of each effective sentence, and speaker identification. The accuracy rate of sentences is ≥ 95%.

➤ Speech data of different languages

759 Hours — Hindi Speech Data by Mobile Phone

The data is 759 hours long and was recorded by 1,425 Indian native speakers. The accent is authentic. The recording text is designed by language experts and covers general, interactive, car, home and other categories. The text is manually proofread, and the accuracy is high. Recording devices are mainstream Android phones and iPhones. It can be applied to speech recognition, machine translation, and voiceprint recognition.

760 Hours — Vietnamese Speech Data by Mobile Phone

1751 Vietnamese native speakers participated in the recording with authentic accent. The recorded script is designed by linguists and covers a wide range of topics including generic, interactive, on-board and home. The text is manually proofread with high accuracy. It matches with mainstream Android and Apple system phones.

522 Hours — Filipino Speech Data by Mobile Phone

522 Hours — Filipino Speech Data by Mobile Phone，the data were recorded by Filipino speakers with authentic Filipino accents.The text is manually proofread with high accuracy. Match mainstream Android, Apple system phones.

1,652 Hours — Cantonese Dialect Speech Data by Mobile Phone

It collects 4,888 speakers from Guangdong Province and is recorded in quiet indoor environment. The recorded content covers 500,000 commonly used spoken sentences, including high-frequency words in weico and daily used expressions. The average number of repetitions is 1.5 and the average sentence length is 12.5 words. Recording devices are mainstream Android phones and iPhones.

End

If you want to know more details about the datasets or how to acquire, please feel free to contact us: info@nexdata.ai.

In the development of artificial intelligence, the importance of datasets are no substitute. For AI model to better understanding and predict human behavior, we have to ensure the integrity and diversity of data as prime mission. By pushing data sharing and data standardization construction, companies and research institutions will accelerate AI technologies maturity and popularity together.

How AI is Breaking the Language Barrier

End

Recent

Embodied intelligence 101: IShowSpeed Dances with Advanced Robot in Shenzhen

Join Nexdata MLC-SLM Workshop at Interspeech 2025

Exploring Datasets for iBeta Certification: A Guide for Biometric System Developers

Previous

Improve Speech Emotion Recognition with High-quality Datasets

Next

What’s the Challenge of Human Behavior Recognition? In view of Data