Using High-quality TTS Data to Optimize your AI Models

From:Nexdata Date: 08/15/2024

➤ Applications of speech synthesis

With the rapid development of AI technology, datasets has become a core factor of improving intelligent system’s performance. The variety and accuracy of datasets determine the learning ability and execution effect of AI models. In the progress of training intelligent system, large amount of datasets from real world are indispensable resources. Collecting and labeling data scientifically can help AI models gain accurate results in real applications, reduce the rate of misjudgment, and improve user experience and system efficiency.

Speech synthesis, also known as TTS (Text to Speech), is a technology that artificially generates human speech and converts arbitrary text information into standard and smooth speech read aloud in real time. It’s an indispensable part for human-machine interaction. Speech recognition technology allows computers to learn to “listen”, while speech synthesis technology allows computer to “speak” like a human.

➤ Personalized TTS data solutions

From map navigation, voice assistant, news reading, to smart customer service, call centers, and broadcast in public, the application of TTS is everywhere in our life.

Apart from text-to-speech, the research scope of speech synthesis technology also includes: singing synthesis, whisper synthesis, dialect synthesis, animal sound synthesis, and etc. At present, speech synthesis technology has been successfully applied in many fields.

Different from the traditional TTS broadcast synthesis, personalized TTS application are becoming more and more popular. Based on massive speech and text data annotation experience, Nexdata provides high-quality, multi-scenario, and multi-category speech synthesis data solutions.

100 People — Chinese Mandarin Average Tone Speech Synthesis Corpus, General

The corpus is recorded by Chinese native speakers. It covers news, dialogue, audio books, poetry, advertising, news broadcasting, entertainment; and the phonemes and tones are balanced. The words accuracy rate is not less than 99.9%, the phoneme accuracy rate is note less than 99%, the prosodic accuracy rate is not less than 98%.

19.46 Hours — American English Speech Synthesis Corpus-Female

The corpus is recorded by American English native speakers, with authentic accent and sweet sound. The phoneme coverage is balanced.‍‍ The words accuracy rate is not less than 99%, the phoneme accuracy rate is note less than 98%, the prosodic accuracy rate is not less than 98%.

➤ Chinese Mandarin Speech Corpus

10 Hours — Chinese Mandarin Synthesis Corpus-Female, Customer Service‍

The corpus is recorded by Chinese native speakers, with lively and frindly voice. The phoneme coverage is balanced. The words accuracy rate is not less than 99.8%, the phoneme accuracy rate is note less than 98%, the accuracy of syllable boundary is not less than 98%.

6.78 Hours — Chinese Mandarin Speech Synthesis Corpus-Female Imitating Children

The corpus is recorded by Chinese native speakers, with authentic accent and sweet sound. The phoneme coverage is balanced. The words accuracy rate is not less than 99%.

With the rapid development of speech synthesis technology, the speech generated by TTS will become more and more natural and vivid. We firmly believe that the development of technology will continue to break through the conventional obstacles and bring us more convenience for our daily life.

End

If you need data services, please feel free to contact us: info@nexdata.ai

Facing with growing demand for data, companies and researchers need to constantly explore new data collection and annotation methods. AI technology can better cope with fast changing market demands only by continuously improving the quality of data. With the accelerated development of data-driven intelligent trends, we have reason to look forward to a more efficient, intelligent, and secure future.