Conversational Speech Data

From：Nexdata Date： 2024-08-15

➤ Speech recognition in natural scenarios

Recently, AI technology’s application covers many fields, from smart security to autonomous driving. And behind every achievement is inseparable from strong data support. As the core factor of AI algorithm, datasets aren’t just the basis for model training, but also the key factor for improving mode performance, By continuously collecting and labeling various datasets, developer can accomplish application with more smarter, efficient system.

With the implementation of speech recognition technology in more natural scenarios such as smart customer service and smart meetings, the training effect of reading aloud speech data has become unsatisfactory.

Because the speaker's pronunciation habits are more natural in daily life, there will be a lot of legato, swallowing, pronunciation deformation, and unclear articulation when speaking. The speaker often does not deliberately control the voice and pronunciation habits, and multiple people communicate at the same time. Sometimes there may even be complex speech phenomena such as sentence interruption, word rush, overlapping sounds, etc., so the speech recognition rate of this natural dialogue style is not very ideal.

➤ Natural dialogue speech data

Data is the foundation of artificial intelligence. To make artificial intelligence technology have a higher accuracy rate, a training data set that better matches the application scenario is needed. Natural dialogue speech data has become a more urgent data set in the industry.

Nexdata has nearly 40,000 hours of natural dialogue voice data, including Mandarin Chinese, dialects, English, Japanese, Korean, Hindi, Vietnamese, Arabic, Spanish, French, German, Italian, etc. The speakers come from different regions And cities, age and gender coverage balance. All audio has undergone strict manual transcription and quality inspection, marking the text content, the start and end time points of valid sentences, the identity of the recorder, etc., and the sentence accuracy rate is as high as 95%.

1,136 Hours – American English Conversational Speech Data by Mobile Phone

The 1,136-hour American English speech data of natural conversations collected by phone involved more than 1,000 native English speakers in America, developed with proper balance of gender ratio and geographical distribution. Speakers would choose a few familiar topics out of the given list and start conversations to ensure dialogues' fluency and naturalness. The recording devices are various mobile phones. The audio format is 16kHz, 16bit, uncompressed WAV, and all the speech data was recorded in quiet indoor environments. All the speech audio was manually transcript with text content, the start and end time of each effective sentence, and speaker identification. The accuracy rate of sentences is ≥ 95%.

607 Hours - Cantonese Conversational Speech Data by Mobile Phone and Voice Recorder

The 607-hour Cantonese Conversational Speech Data involved 995 native speakers. Speakers would choose a few familiar topics out of the given list and start conversations to ensure dialogues' fluency and naturalness. The recording devices are various mobile phones and professional audio recorders. The audio format is 16kHz, 16bit, uncompressed WAV, and all the speech data was recorded in quiet indoor environments. All the speech audio was manually transcribed with text content. The start and end time of each effective sentence, and speaker identification and other more attributes are also annotated. The accuracy rate of sentences is ≥ 95%.

500 Hours - Korean Conversational Speech Data by Mobile Phone

➤ Italian & Russian speech data

The 500 Hours - Korean Conversational Speech Data by Mobile Phone collected by phone involved more than 700 native speakers, developed with a proper balance of gender ratio. Speakers would choose a few familiar topics out of the given list and start conversations to ensure the dialogue's fluency and naturalness. The recording devices are various mobile phones. The audio format is 16kHz, 16bit, uncompressed WAV, and all the speech data was recorded in quiet indoor environments. All the speech audio was manually transcribed with text content, the start and end time of each effective sentence, and speaker identification. The accuracy rate of sentences is ≥ 95%.

500 Hours - Italian Conversational Speech Data by Mobile Phone

The 500 Hours - Italian Conversational Speech Data involved more than 700 native speakers, developed with proper balance of gender ratio, Speakers would choose a few familiar topics out of the given list and start conversations to ensure dialogues' fluency and naturalness. The recording devices are various mobile phones. The audio format is 16kHz, 16bit, uncompressed WAV, and all the speech data was recorded in quiet indoor environments. All the speech audio was manually transcribed with text content, the start and end time of each effective sentence, and speaker identification. The accuracy rate of word is ≥ 98%.

100 Hours - Russian Conversational Speech Data by Mobile Phone

The 100 Hours - Russian Conversational Speech Data involved more than 130 native speakers, developed with proper balance of gender ratio, Speakers would choose a few familiar topics out of the given list and start conversations to ensure dialogues' fluency and naturalness. The recording devices are various mobile phones. The audio format is 16kHz, 16bit, uncompressed WAV, and all the speech data was recorded in quiet indoor environments. All the speech audio was manually transcribed with text content, the start and end time of each effective sentence, and speaker identification.

Facing with growing demand for data, companies and researchers need to constantly explore new data collection and annotation methods. AI technology can better cope with fast changing market demands only by continuously improving the quality of data. With the accelerated development of data-driven intelligent trends, we have reason to look forward to a more efficient, intelligent, and secure future.

Conversational Speech Data

Recent

How to Train Embodied AI That Works Everywhere: A Universal Dataset Blueprint

Embodied intelligence 101: IShowSpeed Dances with Advanced Robot in Shenzhen

Join Nexdata MLC-SLM Workshop at Interspeech 2025

Previous

Challenges of Code-switch Speech Recognition

Next

Challenges of Korea Speech Recognition