en

Please fill in your name

Mobile phone format error

Please enter the telephone

Please enter your company name

Please enter your company email

Please enter the data requirement

Successful submission! Thank you for your support.

Format error, Please fill in again

Confirm

The data requirement cannot be less than 5 words and cannot be pure numbers

What is Emotional Speech Synthesis?

From:Nexdata Date: 2024-04-07

In the field of voice interaction, speech synthesis is an important part, and its technology is also constantly developing. In recent years, there has been a growing interest and demand for emotion synthesis. Emotional speech synthesis will allow the machine to communicate with us like a real person. It can express different emotions such as angry voices, happy voices, and sad voices, and even different emotions of different intensities.

Emotional speech conversion technology can convert speech from one emotional state to another under the premise of keeping the identity of the speaker and the content of the language unchanged. Simply put, it is to properly transfer the emotional expression from an emotional speaker to the target speaker while maintaining a good target speaker timbre.

Emotional Speech Synthesis Technology

Emotional speech synthesis systems can use speaker and emotion embedding model solutions. Use emotion as a label, that is, add an emotion label on the basis of the original network, and the information of these emotions will be learned through the network.

Speaker embedding is to obtain a speaker vector through a neural network, which requires a certain scale of multi-person database for training.

Emotional embedding requires emotional data combined with speaker vectors to implement an emotional speech synthesis model, so high-quality, multi-emotional data is required.

For example, cross-speaker emotion transfer can use emotion and timbre perturbation to learn speaker and emotion-related spectrums respectively, and provide explicit emotion features for the final speech generation. Speaker correlation is to maintain the timbre of the target speaker, and emotion correlation is to capture the emotional expression of the source speaker. Therefore, data from multiple people with multiple emotions and multiple people without emotion are needed for joint training.

Application Scenarios of Emotional Speech Synthesis

Avatar: It can make virtual characters have certain emotional expression ability.

Short video dubbing: You can dub the content of the short video to make the content more lively and interesting.

Game role: It allows users to have a better experience in the game.

Film and television animation: It can carry out vivid explanation.

Intelligent customer service: It can improve the human-computer interaction experience and make the interaction full of fun.

Nexdata Emotional Speech Synthesis Data Solution

As the world’s leading artificial intelligence data service provider,  can provide customers with rich emotional voice data. The artificial intelligence trained by these data can synthesize voices richer in emotion and expression, making the synthesized voice more natural and real.

The 13.3 Hours — Chinese Mandarin Synthesis Corpus-Female, Emotional. It is recorded by Chinese native speaker, emotional text, and the syllables, phonemes and tones are balanced. Professional phonetician participates in the annotation. It precisely matches with the research and development needs of the speech synthesis.

Male audio data of American English. It is recorded by American English native speakers, with authentic accent. The phoneme coverage is balanced. Professional phonetician participates in the annotation. It precisely matches with the research and development needs of the speech synthesis.

10.4 Hours — Japanese Synthesis Corpus-Female. It is recorded by Japanese native speaker, with authentic accent. The phoneme coverage is balanced. Professional phonetician participates in the annotation. It precisely matches with the research and development needs of the speech synthesis.

Female audio data of American English,. It is recorded by American English native speaker, with authentic accent and sweet sound. The phoneme coverage is balanced. Professional phonetician participates in the annotation. It precisely matches with the research and development needs of the speech synthesis.

50 People — Chinese Average Tone Speech Synthesis Corpus-Three Styles.It is recorded by Chinese native speakers. Corpus includes cunstomer service,news and story. The syllables, phonemes and tones are balanced. Professional phonetician participates in the annotation. It precisely matches with the research and development needs of the speech synthesis.

End

If you want to know more details about the datasets or how to acquire, please feel free to contact us: info@nexdata.ai.

9b9dd357-1d00-4a05-ba3f-2e8e0e04fcf0