Nexdata Uncommon Language Speech Recognition Dataset

From：Nexdata Date： 2024-08-15

➤ Nexdata's Uncommon Language Datasets

With the widespread machine learning technology, data’s importance shown. Datasets isn’t just provide the foundation for the architecture of AI system, but also determine the breadth and depth of applications. From anti-spoofing to facial recognition, to autonomous driving, perceived data collection and processing have become a prerequisites for achieving technological breakthroughs. Hence, high-quality data sources are becoming an important asset for market competitiveness.

A major problem with speech recognition datasets on the market is that they focus on European languages or English. For the realization of uncommon language speech recognition, due to the great differences between different languages, artificial intelligence manufacturers need to model separately according to different language characteristics. In order to ensure the effect of speech recognition, high-quality speech recognition dataset in different languages are needed for model optimization. However, the scarcity of high-quality uncommon language speech recognition dataset has become a major bottleneck in speech recognition.

As the world's leading AI data service provider, Nexdata currently has pre-labeled speech recognition dataset in more than 30 uncommon languages, which can meet the needs of speech recognition in most uncommon languages. Nexdata strictly abides by the relevant regulations, and all the collected speech recognition datasets have been authorized by the person being collected.

Nexdata Uncommon Language Speech Recognition Dataset

➤ Speech recognition datasets by phone

760 Hours - Vietnamese Speech Recognition Dataset by Mobile Phone

1751 Vietnamese native speakers participated in the recording with authentic accent. The recorded script is designed by linguists and covers a wide range of topics including generic, interactive, on-board and home. The text is manually proofread with high accuracy. It matches with mainstream Android and Apple system phones.

292 Hours – Thai Speech Recognition Dataset by Mobile Phone_Reading

Thai Speech Recognition Dataset (reading) is collected from 498 Thailand native speakers and is recorded in quiet environment. The recording is rich in content, covering multiple categories such as economics, entertainment, news, figure, and oral. Around 400 sentences for each speaker. The valid data volume is 292 hours. All texts are manual transcribed with high accuracy.

759 Hours - Hindi Speech Recognition Dataset by Mobile Phone

The data is 759 hours long and was recorded by 1,425 Indian native speakers. The accent is authentic. The recording text is designed by language experts and covers general, interactive, car, home and other categories. The text is manually proofread, and the accuracy is high. Recording devices are mainstream Android phones and iPhones. Hindi Speech Recognition Dataset can be applied to speech recognition, machine translation, and voiceprint recognition.

134 Hours - Malay Speech Recognition Dataset by Mobile Phone_Reading

156 Speakers - Mobile Telephony Malay Speech Recognition Dataset_Reading is recorded by native Malay speakers in the quiet environment. The recording is rich in content, covering multiple categories such as economy, entertainment, news, oral language, numbers, and letters. Around 450 sentences for each speaker. The effective time is 134 hours. All texts are manually transcribed to ensure high accuracy.

639 Hours - Indonesian Speech Recognition Dataset by Mobile Phone

➤ Indonesian Speech Recognition Dataset

1285 Indonesian native speakers participated in the recording with authentic accent. The recorded script is designed by linguists and cover a wide range of topics including generic, interactive, on-board and home. The text is manually proofread with high accuracy. It matches with mainstream Android and Apple system phones. Indonesian Speech Recognition Dataset can be applied for automatic speech recognition, and machine translation scenes.

End

If you want to know more details about the speech recognition datasets or how to acquire, please feel free to contact us: [email protected].

The future of AI is highly dependent on the support of data. With the development of technology and the expansion of application scenarios, high-quality datasets will become the key point to promoting AI performance. In this data-driven revolution, we will be able to better meet the opportunities and challenges of technology development if we constantly focus on data quality and strengthen data security management.

Nexdata Uncommon Language Speech Recognition Dataset

Recent

How to Train Embodied AI That Works Everywhere: A Universal Dataset Blueprint

Embodied intelligence 101: IShowSpeed Dances with Advanced Robot in Shenzhen

Join Nexdata MLC-SLM Workshop at Interspeech 2025

Previous

What is Speech Recognition Dataset?

Next

Scale up Your AI Initiatives with High-quality Speech Recognition Dataset