Enhancing Multilingual Speech Recognition in the Automotive Industry with Data Localization

From:Nexdata Date: 08/14/2024

➤ Challenges in Indonesian speech recognition

The development of Modern AI, not only relies on complex algorithms and calculate abilities, but also requires a massive amount of real and accurate data as support. For companies and research institutes, having high-quality datasets means gaining an advantage in technology innovation competitiveness. As increasingly demanding of AI model’s accuracy and generalization, specialized data collection and annotation work has becomes indispensable.

Indonesian is one of the most widely spoken languages globally, with over 270 million speakers spread across the archipelago. As technology becomes increasingly integrated into everyday life, it is crucial to enable Indonesian speakers to communicate with and command devices using their native language. However, developing a robust speech recognition system for Indonesian presents unique challenges due to its phonological complexity and rich morphological structure.

Training data is the backbone of any machine learning model, and speech recognition systems are no exception. High-quality training data plays a pivotal role in the accuracy and performance of these systems. In the case of Indonesian speech recognition, having a diverse and extensive dataset of spoken language is essential. This dataset should encompass a wide range of accents, dialects, and speaking styles to ensure the model's ability to adapt to variations in natural speech.

➤ Indonesian speech data challenges

Obtaining sufficient and accurate training data for Indonesian speech recognition is not without challenges. Firstly, the vast linguistic diversity across Indonesia means that the dataset must capture the nuances of various regional accents and linguistic variations. Secondly, privacy concerns and ethical considerations require developers to anonymize and secure the data while complying with data protection regulations.

Indonesian Speech Datasets

359 Hours-Indonesian Speech Data by Mobile Phone

Indonesia speech data (reading) is collected from 496 Indonesian native speakers and is recorded in quiet environment. The recording is rich in content, covering multiple categories such as econimics, entertainment, news, figure, letter, and oral. Around 400 sentences for each speaker. The valid data volumn is 360 hours. All texts are manual transcribed with high accuray.

496 People – Indonesian Speech Data by Mobile Phone_Guiding

Indonesia speech data (guiding) is collected from 496 Indonesian native speakers and is recorded in quiet environment. The recording is rich in content, covering multiple categories such as in-car scene, smart home, speech assistant. 50 sentences for each speaker. The valid volumn is 10.5 hour. All texts are manual transcribed with high accuray.

639 Hours - Indonesian Speech Data by Mobile Phone

1285 Indonesian native speakers participated in the recording with authentic accent. The recorded script is designed by linguists and cover a wide range of topics including generic, interactive, on-board and home. The text is manually proofread with high accuracy. It matches with mainstream Android and Apple system phones. The data set can be applied for automatic speech recognition, and machine translation scenes.

➤ Indonesian Conversational Speech Data

108 Hours - Indonesian Conversational Speech Data by Mobile Phone

The 108 Hours - Indonesian conversational speech data collected by phone involved 140 native speakers, developed with proper balance of gender ratio, Speakers would choose a few familiar topics out of the given list and start conversations to ensure dialogues' fluency and naturalness. The recording devices are various mobile phones. The audio format is 16kHz, 16bit, uncompressed WAV, and all the speech data was recorded in quiet indoor environments. All the speech audio was manually transcribed with text content, the start and end time of each effective sentence, and speaker identification.

89 Hours - Indonesian Conversational Speech Data by Telephone

The 89 Hours - Indonesian conversational speech data collected by Telephone involved 124 native speakers, developed with proper balance of gender ratio, Speakers would choose a few familiar topics out of the given list and start conversations to ensure dialogues' fluency and naturalness. The recording devices are various mobile phones. The audio format is 8kHz, 8bit, u-law pcm, and all the speech data was recorded in quiet indoor environments. All the speech audio was manually transcribed with text content, the start and end time of each effective sentence, and speaker identification.

With the continuous advance of data technology, we can look expect more innovative AI applications emerge in all walks of life. As we mentioned at the beginning, the importance of data in AI cannot be ignored, and high-quality data will continuously drive technological breakthroughs.

Enhancing Multilingual Speech Recognition in the Automotive Industry with Data Localization

Recent

Case Study: Nexdata UMI Data Collection

Case Study: Ego-Centric Data Project for Physical AI Model Development

Ego-centric Data Collection for Physical AI

Previous

Advancing Low-Resource Speech Recognition through Training Data Strategies

Next

Artificial Intelligence: Transforming Cybersecurity and Data Protection