Advancing Low-Resource Speech Recognition through Training Data Strategies

From：Nexdata Date： 2024-08-14

➤ Challenges in Indonesian speech recognition

The quality and diversity of datasets determine the intelligence level of AI model. Whether it is used for smart security, autonomous driving, or human-machine interaction, the accuracy of datasets directly affect the performance of the model. With the development of data collection technology, all type of customized datasets are constantly being created to support the optimization of AI algorithm. Though in-depth research on these types of datasets, AI technology’s application prospects will be broader.

Indonesian is one of the most widely spoken languages globally, with over 270 million speakers spread across the archipelago. As technology becomes increasingly integrated into everyday life, it is crucial to enable Indonesian speakers to communicate with and command devices using their native language. However, developing a robust speech recognition system for Indonesian presents unique challenges due to its phonological complexity and rich morphological structure.

Training data is the backbone of any machine learning model, and speech recognition systems are no exception. High-quality training data plays a pivotal role in the accuracy and performance of these systems. In the case of Indonesian speech recognition, having a diverse and extensive dataset of spoken language is essential. This dataset should encompass a wide range of accents, dialects, and speaking styles to ensure the model's ability to adapt to variations in natural speech.

➤ Indonesian speech recognition data

Obtaining sufficient and accurate training data for Indonesian speech recognition is not without challenges. Firstly, the vast linguistic diversity across Indonesia means that the dataset must capture the nuances of various regional accents and linguistic variations. Secondly, privacy concerns and ethical considerations require developers to anonymize and secure the data while complying with data protection regulations.

Indonesian Speech Datasets

359 Hours-Indonesian Speech Data by Mobile Phone

Indonesia speech data (reading) is collected from 496 Indonesian native speakers and is recorded in quiet environment. The recording is rich in content, covering multiple categories such as econimics, entertainment, news, figure, letter, and oral. Around 400 sentences for each speaker. The valid data volumn is 360 hours. All texts are manual transcribed with high accuray.

496 People – Indonesian Speech Data by Mobile Phone_Guiding

Indonesia speech data (guiding) is collected from 496 Indonesian native speakers and is recorded in quiet environment. The recording is rich in content, covering multiple categories such as in-car scene, smart home, speech assistant. 50 sentences for each speaker. The valid volumn is 10.5 hour. All texts are manual transcribed with high accuray.

639 Hours - Indonesian Speech Data by Mobile Phone

1285 Indonesian native speakers participated in the recording with authentic accent. The recorded script is designed by linguists and cover a wide range of topics including generic, interactive, on-board and home. The text is manually proofread with high accuracy. It matches with mainstream Android and Apple system phones. The data set can be applied for automatic speech recognition, and machine translation scenes.

➤ Indonesian conversational speech data

108 Hours - Indonesian Conversational Speech Data by Mobile Phone

The 108 Hours - Indonesian conversational speech data collected by phone involved 140 native speakers, developed with proper balance of gender ratio, Speakers would choose a few familiar topics out of the given list and start conversations to ensure dialogues' fluency and naturalness. The recording devices are various mobile phones. The audio format is 16kHz, 16bit, uncompressed WAV, and all the speech data was recorded in quiet indoor environments. All the speech audio was manually transcribed with text content, the start and end time of each effective sentence, and speaker identification.

89 Hours - Indonesian Conversational Speech Data by Telephone

The 89 Hours - Indonesian conversational speech data collected by Telephone involved 124 native speakers, developed with proper balance of gender ratio, Speakers would choose a few familiar topics out of the given list and start conversations to ensure dialogues' fluency and naturalness. The recording devices are various mobile phones. The audio format is 8kHz, 8bit, u-law pcm, and all the speech data was recorded in quiet indoor environments. All the speech audio was manually transcribed with text content, the start and end time of each effective sentence, and speaker identification.

Data quality play a vital role in the development of artificial intelligence. In the future, with the continuous development of AI technology, the collection, cleaning, and annotation of datasets will become more complex and crucial. By continuously improve data quality and enrich data resources, AI systems will accurately satisfy all kinds of needs.

Advancing Low-Resource Speech Recognition through Training Data Strategies

Recent

How to Train Embodied AI That Works Everywhere: A Universal Dataset Blueprint

Embodied intelligence 101: IShowSpeed Dances with Advanced Robot in Shenzhen

Join Nexdata MLC-SLM Workshop at Interspeech 2025

Previous

Indonesian Speech Data

Next

Enhancing Multilingual Speech Recognition in the Automotive Industry with Data Localization