Improving your AI Models with High-quality Chinese Dialects Data

From：Nexdata Date： 2024-08-15

➤ Nexdata's Chinese dialect data

In the progress of constructing intelligent system, the quality of the training datasets are more important than algorithm itself. For coping with different challenges in complex scenarios, researchers need to collect and annotate different types of data to improve the capabilities of AI system. Nowadays, every industries are exploring constantly how to use data-driven technology to realize smarter business processes and decision-making systems.

With the expansion of AI applications, dialect recognition has received increasing attention. However, due to the huge difference between Chinese dialects and Mandarin, the speech recognition of Chinese dialects is much more complicated.

➤ Dialect Conversational Speech Data

Generally speaking, the speech data collection is to record commonly used sentences and words through text, phonetic symbols and voice and integrate the recorded contents to a database. However, the numerous types of dialects in China mean that the data to be collected is also massive, and it is difficult to establish a national dialect database in a short time.

For the large-scale applications of Chinese dialect, Nexdata has arranged in advance and has accumulated 25,000 hours of Chinese dialects data, covering dialect regions of Fujian, Guangdong, Wu, Hunan, Southwest, Northeast, Central Plains and ethnic minorities. The datasets can be delivered in seconds and quickly help to improve the recognition accuracy of AI models. All the datasets are recorded by native speakers with signed authorization agreements.

Cantonese Conversational Speech Data

Nearly 1,000 Cantonese speakers participated in the recording, and conducted face-to-face communication in a natural way. They had free discussion on a number of given topics, with a wide range of fields; the voice was natural and fluent, in line with the actual dialogue scene.

Minnan Dialect Conversational Speech Data

It collects nearly 1,000 speakers from Fujian Province. Dozens of topics are specified, and the speakers make dialogue under those topics while the recording is performeThe accuracy rate of sentence is 95%.

➤ Nexdata's data customization service

Sichuan Dialect Conversational Speech Data

1730 Sichuan native speakers participated in the recording and face-to-face free talking in a natural way in wide fields without the topic specified. It is natural and fluency in speech, and in line with the actual dialogue scene.

If the above data cannot meet the needs of your current research, Nexdata also provides data customization services for specific groups of people, specific scenarios, and specific languages to meet customers’ diversified data needs.

End

If you need data services, please feel free to contact us: info@nexdata.ai

On the road to intelligent future, data will always be an indispensable driving force. The continuous expanding and optimizing of all kinds of datasets will provide a broader application space for AI algorithms. By constant exploring new data collection and annotation methods, all industries can better handle complex application scenarios. If you have data requirements, please contact Nexdata.ai at [email protected].

Improving your AI Models with High-quality Chinese Dialects Data

End

Recent

Embodied intelligence 101: IShowSpeed Dances with Advanced Robot in Shenzhen

Join Nexdata MLC-SLM Workshop at Interspeech 2025

Exploring Datasets for iBeta Certification: A Guide for Biometric System Developers

Previous

How AI Helps Empower Intelligent Manufacturing

Next

Leverage High-Quality Children Speech Data to Train AI Models