Fueling AI Performance in Customer Service with Telephony Speech Data

From：Nexdata Date： 2024-08-14

➤ Chinese dialects and speech recognition

AI-based application cannot be achieved without the support of massive amount of data. Whether it is conversational AI, autonomous driving or medical image analysis, the diversity and integrity of training datasets largely affect the test result of AI models. Today, data has become a crucial factor in promoting the progress of intelligent technology, and various fields have been constantly collecting and building more specific datasets to achieve more efficient tech applications.

Chinese, with its rich linguistic heritage, is a language that boasts a multitude of dialects. From Mandarin to Cantonese, Shanghainese to Hokkien, these dialects reflect the diverse cultural and regional identities across China. However, this linguistic diversity poses a significant challenge when it comes to speech recognition technology.

Speech recognition is the process of converting spoken words into written text using advanced algorithms and machine learning. It has become increasingly prevalent in our daily lives, from virtual assistants like Siri and Alexa to voice-controlled devices. However, the complexity of Chinese dialects complicates the development and implementation of accurate speech recognition systems.

➤ Challenges in Chinese dialects recognition

One of the primary challenges lies in the vast differences in pronunciation and vocabulary among Chinese dialects. Mandarin, the official language of China, serves as a common standard, but even within Mandarin, there are variations across different regions. For example, the pronunciation of certain sounds may differ between northern and southern dialects. This variability makes it difficult for speech recognition systems to accurately interpret and transcribe spoken words, leading to errors and misinterpretations.

Furthermore, the lack of standardized written forms for some Chinese dialects adds another layer of complexity. While Mandarin has a unified system of characters, dialects like Cantonese are predominantly spoken languages with limited written representation. This lack of standardized characters makes it challenging for speech recognition systems to match spoken words with written equivalents accurately.

Another hurdle is the limited availability of training data for Chinese dialects. Speech recognition systems rely heavily on vast amounts of labeled data to learn and improve their accuracy. However, compared to Mandarin, there is significantly less data available for other Chinese dialects. This scarcity hinders the training of speech recognition models for these dialects, impeding their development and accuracy.

Nexdata Chinese Dialects Data

500 Hours – Minnan Dialect Conversational Speech Data by Mobile Phone

The 500 Hours – Minnan Dialect Conversational Speech Data collected by phone involved more than 1,000 native speakers, developed with a proper balance of gender ratio and geographical distribution. Speakers would choose a few familiar topics out of the given list and start conversations to ensure the dialogue's fluency and naturalness. The recording devices are various mobile phones. The audio format is 16kHz, 16bit, uncompressed WAV, and all the speech data was recorded in quiet indoor environments. All the speech audio was manually transcribed with text content, and the start and end timestamps of each effective sentence and speaker identification, including gender, were also annotated. The accuracy rate of sentences is ≥ 95%.

799 Hours - Sichuan Dialect Conversational Speech Data by Mobile Phone

The 799 Hours - Sichuan Dialect Conversational Speech Data by Mobile Phone collected by phone involved 1,730 native speakers. Speakers conduct conversations without topic limit to ensure the dialogue's fluency and naturalness. The recording devices are various mobile phones. The audio format is 16kHz, 16bit, uncompressed WAV, and all the speech data was recorded in quiet indoor environments. All the speech audio was manually transcribed into text content, the start and end time of each effective sentence, speaker identification and other more attributes are annotated. The accuracy rate of sentences is ≥ 95%.

➤ Speech data in Taiwan and Guangdong

203 People - Taiwanese Mandarin Speech Data by Mobile Phone_Guiding

The data collected 203 Taiwan people, covering Taipei, Kaohsiung, Taichung, Tainan, etc. 137 females, 66 males. It is recorded in quiet indoor environment. It can be used in speech recognition, machine translation, voiceprint recognition model training and algorithm research.

1,652 Hours – Cantonese Dialect Speech Data by Mobile Phone

It collects 4,888 speakers from Guangdong Province and is recorded in quiet indoor environment. The recorded content covers 500,000 commonly used spoken sentences, including high-frequency words in weico and daily used expressions. The average number of repetitions is 1.5 and the average sentence length is 12.5 words. Recording devices are mainstream Android phones and iPhones.

Data quality play a vital role in the development of artificial intelligence. In the future, with the continuous development of AI technology, the collection, cleaning, and annotation of datasets will become more complex and crucial. By continuously improve data quality and enrich data resources, AI systems will accurately satisfy all kinds of needs.

Fueling AI Performance in Customer Service with Telephony Speech Data

Recent

Embodied intelligence 101: IShowSpeed Dances with Advanced Robot in Shenzhen

Join Nexdata MLC-SLM Workshop at Interspeech 2025

Exploring Datasets for iBeta Certification: A Guide for Biometric System Developers

Previous

Chinese Dialects Data

Next

Revolutionizing Refined Urban Governance with High-Quality Datasets