Upgrade Your Speech Recognition Models with Large Scale Data

From：Nexdata Date： 2024-08-14

➤ Smart voice market growth and challenges

With the widespread machine learning technology, data’s importance shown. Datasets isn’t just provide the foundation for the architecture of AI system, but also determine the breadth and depth of applications. From anti-spoofing to facial recognition, to autonomous driving, perceived data collection and processing have become a prerequisites for achieving technological breakthroughs. Hence, high-quality data sources are becoming an important asset for market competitiveness.

The data shows that the global smart voice market will grow from US$11.03 billion in 2017 to US$26.39 billion in 2021, and the global smart voice industry will reach US$35.12 billion in 2022, maintaining a high growth rate of 33.1%. It is expected to reach US$39.92 billion in 2023.

In the past ten years, speech recognition technology has made great progress. Continuous speech and non-specific real-time speech recognition systems have been successfully developed and developed in the laboratory. A large number of speech recognition technologies have entered the stage of implementation . However, the actual application scenarios of speech recognition face various challenges. In summary, the challenges mainly include three aspects: robustness, low resources, and complex scenarios.

Typical problems of robustness include accents and dialects, mixed or multilingual languages, domain adaptation, etc. Low resources refer to scenarios where system deployment resources are limited and annotation data is scarce. The former is typically the deployment of various end-side devices in AIoT scenarios The limitation of model size and computing power, and the lack of training data are also key factors that limit the development of speech recognition in various vertical domains and languages.

➤ Nexdata's speech recognition datasets

In order to solve the problem of lack of speech recognition data, Nexdata has designed and developed 200,000 hours of speech recognition datasets, including more than 60 languages and dialects, such as Chinese Mandarin, English, Japanese, Korean, Hindi, Vietnamese, Arabic, Spanish, French, German, Italian, and Portuguese .

344 People - American English Speech Data by Mobile Phone_Guiding

The data set contains 344 American English speakers' speech data, all of whom are American locals. 50 sentences for each speaker. The valid data is 9.7 hours. It is recorded in quiet environment. The contents cover in-car scenario, smart home and speech assistant.

520 Hours - French Speaking English Speech Data by Mobile Phone

1089 French native speakers participated in the recording with authentic accent. The recorded script is designed by linguists and cover a wide range of topics including generic, interactive, on-board and home. The text is manually proofread with high accuracy. It matches with mainstream Android and Apple system phones. The data set can be applied for automatic speech recognition, and machine translation scenes.

211 Hours - German Speech Data by Mobile Phone_Reading

The data set contains 327 German native speakers' speech data. The recording contents include economics, entertainment, news, oral, figure, letter, etc. Each sentence contains 10.3 words on average. Each sentence is repeated 1.4 times on average. All texts are manually transcribed to ensure the high accuracy.

347 Hours-Italian Speech Data Collected by Mobile Phone

Italian audio data captured by mobile phone , with total duration of 347 hours. It is recorded by 800 Italian native speakers, balanced in gender is balanced; the recording environment is quiet; all texts are manually transferred with high accuracy. This data set can be applied on automatic speech recognition, machine translation, and sound pattern recognition.

1,044 Hours - Brazilian Portuguese Speech Data by Mobile Phone

➤ Speech data of different languages

The 1,044 Hours - Brazilian Portuguese Speech Data of natural conversations collected by phone involved more than 2,038 native speakers, developed with proper balance of gender ratio and geographical distribution. Speakers would choose linguistic experts designed topics conduct conversations. The recording devices are various mobile phones. The audio format is 16kHz, 16bit, uncompressed WAV, and all the speech data was recorded in quiet indoor environments. All the speech audio was manually transcript with text content, the start and end time of each effective sentence, and speaker identification. The accuracy rate of sentences is ≥ 95%.

759 Hours - Hindi Speech Data by Mobile Phone

The data is 759 hours long and was recorded by 1,425 Indian native speakers. The accent is authentic. The recording text is designed by language experts and covers general, interactive, car, home and other categories. The text is manually proofread, and the accuracy is high. Recording devices are mainstream Android phones and iPhones. It can be applied to speech recognition, machine translation, and voiceprint recognition.

234 Hours-Japanese Speech Data by Mobile Phone_R

It collects 799 Japanese locals and is recorded in quiet indoor places, streets, restaurant. The recording includes 210,000 commonly used written and spoken Japanese sentences. The error rate of text transfer sentence is less than 5%. Recording devices are mainstream Android phones and iPhones.

Based on different application scenarios, developers needs customize data collection and annotation. For example, autonomous drive need fine-grained street view annotation, medical image analysis require super resolution professional image. With the integration of technology and reality, high-quality datasets will continue to play a vital role in the development of artificial intelligence.

Upgrade Your Speech Recognition Models with Large Scale Data

Recent

How to Train Embodied AI That Works Everywhere: A Universal Dataset Blueprint

Embodied intelligence 101: IShowSpeed Dances with Advanced Robot in Shenzhen

Join Nexdata MLC-SLM Workshop at Interspeech 2025

Previous

Unlocking the Potential of Autonomous Vehicles

Next

Train Speech Enhancement Models with Noise Speech Training Data