Building an Intelligent Voice Assistant with High Quality Data

From：Nexdata Date： 2024-04-07

Right now, the voice assistant has become a standard on the smartphones. Apple’s Siri, Amazon’s Alexa, and Samsung’s Bixby are the representatives of smartphone voice assistants.

Speech technology is one of the areas where artificial intelligence has made the fastest breakthroughs, and the error rate of speech recognition has dropped from nearly a third in 2012 to about 3% today. This technological breakthrough allows machines to “hear” and, in a sense, “understand” human thoughts and intentions.

When it comes to speech technology, many people think of speech recognition input method or speech-to-text in WeChat, etc., but in fact this is just speech recognition technology (ASR). Speech technology also includes many branchs, such as voiceprint recognition, TTS, voice cloning, speech enhancement and etc. The most promising application in future is undoubtedly the voice assistant.

The voice assistant technology achieves the user’s command through human-machine dialogue. The specific implementation is: first convert the speech into text through speech recognition, then process and understand the text content through natural language recognition (NLP), respond to the command through the background, and complete the feedback through speech synthesis. The whole process of the human-machine dialogue.

As a world’s leading AI data provider, Nexdata has been adhering to the corporate vision of “Empower AI with data and change the world with intelligence” for many years. In order to help more researchers broaden the research field, enrich the research content, and accelerate the technological iteration, Nexdata has developed a series of speech datasets for voice assistant with multiple languages and domains, such reading speech, natural dialogue, mixed speech and children speech.

Reading Speech Data

American English Speech Data_Reading

The data set contains 349 American English speakers’ speech data, all of whom are American locals. The recording contents cover various categories like economics, entertainment, news and spoken language.

British English Speech Data_Reading

The data set contains 346 British English speakers’ speech data, all of whom are English locals. Recording contents contain various categories like economics, news, entertainment, commonly used spoken language, letter, figure, etc.

Japanese Speech Data_Reading

It collects 799 Japanese locals and is recorded in quiet indoor places, streets and restaurant. The recording contents cover various fields like economy, entertainment, news and spoken language.

Natural Dialogue Data

American English Natural Dialogue Speech Data

The dataset contains 1,000 hours of American English conversation speech data. It’s recorded by 2,000 native speakers. The speakers start the conversation around a familar topic, to ensure the smoothness and nature of the conversation.

French Conversational Speech Data

The dataset contains 500 hours of French conversation speech data. It’s recorded by about 1,000 native speakers. The speakers start the conversation around a familiar topic, to ensure the smoothness and nature of the conversation.

German Conversational Speech Data

Nearly 300 speakers participated in the recording and conducted face-to-face communication in a natural way. They had free discussion on a number of given topics, with a wide range of fields; the voice was natural and fluent, in line with the actual dialogue scene.

Mixed Speech Data

Mixed Speech with Chinese and English Data

The data is recorded by Chinese native speakers with accents covering seven major dialect areas. The recorded text is a mixture of Chinese and English sentences, covering general scenes and human-computer interaction scenes.

Children Speech Data

Chinese Children Speech data

9,780 speakers are children aged 6 to 12, with accent covering 7 Chinese dialect regions. The content contains common children languages such as stories, numbers, and their interactions in car, at home, and with voice assistants.

American Children Speech Data

It is recorded by 219 American children native speakers. The recording texts are mainly storybook, children’s song, spoken expressions, etc. 350 sentences for each speaker. Each sentence contain 4.5 words in average. Each sentence is repeated 2.1 times in average.

The ultimate goal of voice assistant technology is to be a real personal assistant, which can complete a certain level of complexity and help you obtain certain information. With the maturity and application of technology, voice assistants will become the operation mode of streaming mobile devices in the future.

End

If you need data services, please feel free to contact us: info@nexdata.ai

Building an Intelligent Voice Assistant with High Quality Data

End

Recent

Behavior Detection Data: Enhancing Systems through Human Behavior Analysis

Text-to-Speech (TTS) Data: Fueling the Future of Synthetic Voices

Human Voice Datasets: A Key Resource for Speech Technology Development

Previous

Leverage High-Quality Data to Power Multimodal AI Training

Next

Why Conversational Speech Recognition Will Be the Future of Voice Technology