Scale up Your AI Initiatives with High-quality Speech Recognition Dataset

From：Nexdata Date： 2024-08-15

➤ Speech recognition and related support

The era of data-driven artificial intelligence has arrived. The quality of data directly affects the effectiveness and intelligence of the model. In this wave of technological change, datasets in various vertical fields are constantly emerging to meet the needs of machine learning in different scenarios. Whether it is computer vision, natural language processing or behavioral analysis, various datasets contain huge commercial value and technical potential.

According to Deloitte statistics, it is estimated that by 2030, China's smart voice consumer and enterprise application markets will exceed 70 billion Yuan and 100 billion Yuan respectively. From a global perspective, the scale of the global intelligent voice industry will reach US$35.12 billion in 2022, maintaining a high growth rate of 33.1%.

Speech recognition technology simply refers to the technology required for a machine or program to accept passwords, interpret the meaning of sounds, and understand and execute spoken instructions. In the era of intelligence, the application of smart terminals is becoming more and more popular. Correspondingly, more and more scenarios use dialogue as the main form of interaction when designing personalized human-computer interaction interfaces. And a complete dialogue interaction is a closed loop composed of three links of "input-analysis-output". Speech recognition technology is the beginning of dialogue interaction and the basis for ensuring efficient and accurate human-computer interaction.

The speech recognition decoding process includes two parts: recognition modeling and model training of the acoustic model and language model. During the running process, the amount of training data and the amount of calculation required are huge. Therefore, cloud computing technology, which can provide massive speech recognition dataset, storage and high-performance computing capabilities, has become an application hotspot in the speech recognition industry.

As a data service provider with 12+ years of data processing experience, Nexdata provides multi-scenario and multi-type speech recognition dataset. Nexdata has accumulated 200,000 hours of multilingual, multi-channel, multi-environment, and multi-type finished speech recognition dataset. The speech recognition dataset can help customers quickly optimize speech recognition models.

➤ Speech recognition datasets by phone

831 Hours - British English Speech Recognition Dataset by Mobile Phone

831 Hours British English Speech Recognition Dataset, which is recorded by 1651 native British speakers. The recording contents cover many categories such as generic, interactive, in-car and smart home. All the Speech Recognition Dataset was recorded in quiet indoor environments. The texts are manually proofreaded to ensure a high accuracy rate.

1,441 Hours - Italian Speech Recognition Dataset by Mobile Phone

The speech recognition dataset was recorded by 3,109 native Italian speakers with authentic Italian accents. The recorded content covers a wide range of categories such as general purpose, interactive, in car commands, home commands, etc. All the Speech Recognition Dataset was recorded in quiet indoor environments. The recorded text is designed by a language expert, and the text is manually proofread with high accuracy.

1,796 Hours - German Speech Recognition Dataset by Mobile Phone

German speech recognition dataset captured by mobile phone, 1,796 hours in total, recorded by 3,442 German native speakers. The recorded text is designed by linguistic experts, covering generic, interactive, on-board, home and other categories. All the Speech Recognition Dataset was recorded in quiet indoor environments. The text has been proofread manually with high accuracy; this data can be used for automatic speech recognition, machine translation, and voiceprint recognition.

1,044 Hours - Brazilian Portuguese Speech Recognition Dataset by Mobile Phone

1,044 Hours - Brazilian Portuguese Speech Recognition Dataset of natural conversations collected by phone involved more than 2,038 native speakers, developed with proper balance of gender ratio and geographical distribution. Speakers would choose linguistic experts designed topics conduct conversations. The recording devices are various mobile phones. The audio format is 16kHz, 16bit, uncompressed WAV, and all the Speech Recognition Dataset was recorded in quiet indoor environments. All the speech recognition dataset was manually transcript with text content, the start and end time of each effective sentence, and speaker identification. The accuracy rate of sentences is ≥ 95%.

769 Hours - French Speech Recognition Dataset by Mobile Phone

769 Hours - French Speech Recognition Dataset is recorded by 1623 French native speakers. The recording text is designed by linguistic experts, which covers general interactive, in-car and home category. All the Speech Recognition Dataset was recorded in quiet indoor environments. The texts are manually proofread with high accuracy.

800 Hours - American English Speech Recognition Dataset by Mobile Phone

1842 American native speakers participated in the recording with authentic accent. The recorded script is designed by linguists, based on scenes, and cover a wide range of topics including generic, interactive, on-board and home. All the Speech Recognition Dataset was recorded in quiet indoor environments. The text is manually proofread with high accuracy.

516 Hours - Korean Speech Recognition Dataset by Mobile Phone

➤ Korean & Japanese Speech Datasets

The 516 Hours - Korean Speech Recognition Dataset of natural conversations collected by phone involved more than 1,077 native speakers, the duration of each speaker is around half an hour. developed with proper balance of gender ratio and geographical distribution. The recording devices are various mobile phones. The audio format is 16kHz, 16bit, uncompressed WAV, and all the Speech Recognition Dataset was recorded in quiet indoor environments. All the speech audio was manually transcript with text content, the start and end time of each effective sentence, and speaker identification. The accuracy rate of sentences is ≥ 95%.

474 Hours-Japanese Speech Recognition Dataset By Mobile Phone

1006 Japanese native speakers participated in the recording, coming from eastern, western, and Kyushu regions, while the eastern region accounting for the largest proportion. All the Speech Recognition Dataset was recorded in quiet indoor environments. The recording content is rich and all texts have been manually transferred with high accuracy.

End

If you want to know more details about the speech recognition datasets or how to acquire, please feel free to contact us: [email protected].

With the in-depth application of artificial intelligence, the value of data has become prominent. Only with the support of massive high-quality data can AI technology breakthrough its bottlenecks and advance in a more intelligent and efficient direction. In the future, we need to continue to explore new ways of data collection and annotation to better cope with complex business requirements and achieve intelligent innovation.

Scale up Your AI Initiatives with High-quality Speech Recognition Dataset

Recent

Embodied intelligence 101: IShowSpeed Dances with Advanced Robot in Shenzhen

Join Nexdata MLC-SLM Workshop at Interspeech 2025

Exploring Datasets for iBeta Certification: A Guide for Biometric System Developers

Previous

Nexdata Uncommon Language Speech Recognition Dataset

Next

AI in retail