In-Cabin Voice Interaction in Autonomous Driving

From：Nexdata Date： 2024-08-15

➤ Speech Synthesis: Technology and Applications

In intelligent algorithms driven by data, the quality and quantity of data determine the learning efficiency and decision-making precision of AI systems. Different from traditional programming, machine learning and deep learning models rely on massive training data to “self-learn” patterns and rules. Therefore, building and maintain datasets has become the core mission in AI research and development. Through continuously enriching data samples, AI model can handle more complex real world problems, as well as improving the practicality and applicability of technology.

As one of the most mature technologies for AI applications, intelligent voice technology is developing rapidly in the fields of smart home, smart vehicle, and smart wearables. In 2022, the scale of the global intelligent voice industry will reach 35.12 billion US dollars, maintaining a high growth rate of 33.1%.

Speech synthesis, also known as Text to Speech (TTS) technology, is an important research direction in the field of speech processing, which aims to allow machines to generate natural and beautiful human speech. Speech synthesis technology can be applied to different scenarios alone, or it can be embedded into the overall solution of voice interaction as a tail link.

Speech synthesis technology is internally divided into front-end and back-end. The front-end is mainly responsible for language analysis and processing of text, and its processing content mainly includes language, word segmentation, part-of-speech prediction, polyphonic word processing, prosody prediction, emotion, etc. After predicting the pronunciation of the text, the information is sent to the back-end system of TTS. After the background acoustic system fuses the information, it converts the content into speech.

➤ Applications of speech synthesis

The back-end acoustic system has a long history of development, from the first generation of speech splicing synthesis, to the second generation of speech parameter synthesis, to the third generation of end-to-end synthesis. The intelligence level of the back-end acoustic system is gradually increasing, and the level of detail and difficulty of marking training materials is also gradually weakening.

Speech Synthesis Application Scenarios

The application of speech synthesis can be divided into one-way voice output and interaction. It is rare to use one-way voice output or interaction alone. In navigation technology, reading, dubbing, voice broadcast and other scenarios, one-way voice output The proportion of applications is relatively large, and interactive speech synthesis is used more in scenarios such as intelligent customer service, intelligent robots, pan-entertainment industry, and education.

● News & Broadcasting

Provide news broadcasting scenes with stable styles, male and female anchors with correct accents, help traditional news media to quickly complete the construction of audio content, and provide users with diversified content forms.

● Story-telling

Let the contagious voice tell you stories and read novels to meet the listening needs of “lazy people”. Synthesize the content of teaching materials into human voice audio, realize the function of reading aloud and with reading in Chinese and English, so that children can enjoy high-quality educational resources at any time.

● Customer Service

Natural, friendly and strict voice synthesis effects are applied in multiple scenarios such as telephone customer service return visits, customer care, and collections. Using artificial intelligence technology, it helps companies quickly improve customer service efficiency, and ultimately achieve the full achievement of call center business goals.

● Travel Navigation

Speech synthesis has high pronunciation stability, which meets various place names and signs encountered in navigation, and uses sound to enhance product experience and provide guarantee for users’ safe travel.

Nexdata Text-to-Speech Data Solution

Based on massive TTS project implementation experience and advanced TTS technology, Nexdata provides high-quality, multi-scenario, multi-category TTS data solutions.

American English Speech Synthesis Corpus-Female

Female audio data of American English,. It is recorded by American English native speaker, with authentic accent and sweet sound. The phoneme coverage is balanced. Professional phonetician participates in the annotation. It precisely matches with the research and development needs of the speech synthesis.

➤ Speech synthesis datasets in Chinese

American English Speech Synthesis Corpus-Male

Male audio data of American English. It is recorded by American English native speakers, with authentic accent. The phoneme coverage is balanced. Professional phonetician participates in the annotation. It precisely matches with the research and development needs of the speech synthesis.

Japanese Synthesis Corpus-Female

10.4 Hours — Japanese Synthesis Corpus-Female. It is recorded by Japanese native speaker, with authentic accent. The phoneme coverage is balanced. Professional phonetician participates in the annotation. It precisely matches with the research and development needs of the speech synthesis.

Chinese Average Tone Speech Synthesis Corpus-Three Styles

50 People — Chinese Average Tone Speech Synthesis Corpus-Three Styles.It is recorded by Chinese native speakers. Corpus includes cunstomer service,news and story. The syllables, phonemes and tones are balanced. Professional phonetician participates in the annotation. It precisely matches with the research and development needs of the speech synthesis.

Chinese Mandarin Songs in Acapella — Female

103 Chinese Mandarin Songs in Acapella — Female. It is recorded by Chinese professional singer, with sweet voice. Professional phonetician participates in the annotation. It precisely matches with the research and development needs of the song synthesis.

Chinese Mandarin Synthesis Corpus-Female, Emotional

The 13.3 Hours — Chinese Mandarin Synthesis Corpus-Female, Emotional. It is recorded by Chinese native speaker, emotional text, and the syllables, phonemes and tones are balanced. Professional phonetician participates in the annotation. It precisely matches with the research and development needs of the speech synthesis.

End

If you want to know more details about the datasets or how to acquire, please feel free to contact us: info@nexdata.ai.

In the development of artificial intelligence, the importance of datasets are no substitute. For AI model to better understanding and predict human behavior, we have to ensure the integrity and diversity of data as prime mission. By pushing data sharing and data standardization construction, companies and research institutions will accelerate AI technologies maturity and popularity together.

In-Cabin Voice Interaction in Autonomous Driving

End

Recent

Embodied intelligence 101: IShowSpeed Dances with Advanced Robot in Shenzhen

Join Nexdata MLC-SLM Workshop at Interspeech 2025

Exploring Datasets for iBeta Certification: A Guide for Biometric System Developers

Previous

What is Text-to-speech?

Next

Can AI help alert the driver with a value of his fatigue levels?