How Text-to-Speech Training Data Enable AIGC Models

From：Nexdata Date： 2024-04-07

In 2022, the trend of AIGC is sweeping the world, multiple AI fields are developing rapidly, painting, music, news creation, anchor and many other industries are being redefined, and the Metaverse is becoming more and more popular.

On December 16, 2022, Science magazine released the top ten breakthroughs in science in 2022, and AIGC is on the list. Gartner predicts that AIGC will account for 10% of all data generated by 2025. Generative AI: According to ACreative New World analysis, AIGC has the potential to generate trillions of dollars in economic value.

Under the wave of AIGC, the output of AIGC content based on speech synthesis technology has also developed rapidly. From the concatenated speech synthesis technology used by Hawking, to parameter synthesis, and now to the neural network-based parameter management system. TTS is developing very rapidly around the world. The speech synthesizer WaveNet created by DeepMind has greatly improved the sound quality of speech synthesis.

Audio generation can be divided into TTS (Text-to-speech) scene and music generation. Among them, TTS includes voice customer service, audiobook production, intelligent dubbing and other functions. Generating a musical composition includes generating a specific musical composition based on opening melody, pictures, text descriptions, music genres, emotional genres, and the like.

Microsoft’s intelligent speech technology has been upgraded again, and it has realized the support of multiple dialects in speech synthesis. Recently, Microsoft announced that it has implemented support for two Chinese dialects, Wu Dialect and Cantonese Dialect, as well as Southwest Mandarin, Northeast Mandarin, Jilu Mandarin, and Central Plains Mandarin (including Henan and Shaanxi) in speech synthesis.

Large volume and high-quality training data is the foundation of speech synthesis technology. Nexdata provides customers with multi-timbral, multi-language, high-quality training data based on massive voice and text data annotation experience and leading speech synthesis technology.

Japanese Synthesis Corpus-Female

10.4 Hours — Japanese Synthesis Corpus-Female. It is recorded by Japanese native speaker, with authentic accent. The phoneme coverage is balanced. Professional phonetician participates in the annotation. It precisely matches with the research and development needs of the speech synthesis.

American English Speech Synthesis Corpus-Female

Female audio data of American English,. It is recorded by American English native speaker, with authentic accent and sweet sound. The phoneme coverage is balanced. Professional phonetician participates in the annotation. It precisely matches with the research and development needs of the speech synthesis.

American English Speech Synthesis Corpus-Male

Male audio data of American English. It is recorded by American English native speakers, with authentic accent. The phoneme coverage is balanced. Professional phonetician participates in the annotation. It precisely matches with the research and development needs of the speech synthesis.

Chinese Average Tone Speech Synthesis Corpus-Three Styles

50 People — Chinese Average Tone Speech Synthesis Corpus-Three Styles.It is recorded by Chinese native speakers. Corpus includes cunstomer service,news and story. The syllables, phonemes and tones are balanced. Professional phonetician participates in the annotation. It precisely matches with the research and development needs of the speech synthesis.

Chinese Mandarin Songs in Acapella — Female

103 Chinese Mandarin Songs in Acapella — Female. It is recorded by Chinese professional singer, with sweet voice. Professional phonetician participates in the annotation. It precisely matches with the research and development needs of the song synthesis.

Chinese Mandarin Synthesis Corpus-Female, Emotional

The 13.3 Hours — Chinese Mandarin Synthesis Corpus-Female, Emotional. It is recorded by Chinese native speaker, emotional text, and the syllables, phonemes and tones are balanced. Professional phonetician participates in the annotation. It precisely matches with the research and development needs of the speech synthesis.

Besides, Nexdata has rich sample sound resources, outstanding technical advantages and data processing experience, and supports personalized collection services for designated language, timbre, age, and gender. Meanwhile, Nexdata supports data customization services such as audio segmentation, phoneme boundary segmentation (segmentation accuracy of 0.01 seconds), phonetic tagging, prosody tagging, part-of-speech tagging, pitch proofreading, rhythm tagging, and musical score production to fully meet customers’ diverse requirements.

End

If you want to know more details about the datasets or how to acquire, please feel free to contact us: info@nexdata.ai.

How Text-to-Speech Training Data Enable AIGC Models

End

Recent

Behavior Detection Data: Enhancing Systems through Human Behavior Analysis

Text-to-Speech (TTS) Data: Fueling the Future of Synthetic Voices

Human Voice Datasets: A Key Resource for Speech Technology Development

Previous

How AI is Transforming the Fashion Industry

Next

AI in Retail & e-Commerce