Nexdata Trending Multilingual Conversational Speech Datasets

From:Nexdata Date: 08/22/2025

With the rapid development of AI technology, datasets has become a core factor of improving intelligent system’s performance. Nowadays, speech recognition technology still faceing numerous challenges, including accent diversity, specialized industry terminology, and insufficient understanding of context. In particular, the robustness and accuracy of general-purpose speech recognition models decline significantly in specialized applications. Nexdata's large-scale, multilingual natural conversation datasets cover multiple domains, helping customers improve the performance of speech recognition models in various application scenarios.

Natural Conversation Datasets

Minority language data is scarce, and the content is mostly public audio and video, which lacks natural expression and limits model training. Nexdata fully considers the needs of multilingual recognition models and continuously releases hundreds of natural conversation datasets in minor languages, covering over 30 countries, to help improve the performance of multilingual speech recognition models.

French Natural Conversation Speech Dataset

Recorded by over 800 native French speakers from diverse regions and cultural backgrounds, the total duration is approximately 1,200 hours. The data is annotated with various attributes such as text content, sentence timestamps, speaker identity, and gender, ensuring high accuracy.

Portuguese Conversational Speech Dataset

Covering European Portuguese and Brazilian Portuguese. This dataset contains nearly 800 hours of conversations, recorded naturally by native speakers without pre-written text. The data achieves a word accuracy of 98%.

Italian Conversational Speech Dataset

This dataset contains 1,200 hours of conversations, recorded by native Italian speakers in a quiet, non-echoic indoor environment. It covers over 30 common topics, including food, movies, and music. The data is annotated with special labels such as non-textual noise and stable noise.

Spanish Conversational Speech Dataset

Covering Spanish and Mexican Spanish, this dataset contains 1,600 hours of speech. This data has been validated by multiple AI companies, helping models perform well in the face of real-world diversity.

Japanese Natural Conversational Speech Dataset

This dataset contains over 800 hours of data, recorded by over 800 speakers. These speakers communicate naturally, freely exploring a range of topics, and their speech is natural and fluent, reflecting real-world conversation scenarios. The transcripts are manually transcribed for high accuracy.

Domain-Specific Datasets

The industry terminology and jargon required for professional domain recognition are often less common in regular speech datasets, resulting in lower model accuracy when processing specific content. Furthermore, users across various industries may have different dialects and accents, further complicating speech recognition.

Nexdata's proprietary annotated colloquial data covers a variety of fields, including finance, healthcare, gaming, and customer service. Recorded by native speakers with industry expertise, the data covers professional terminology and jargon across various fields, achieving high accuracy.

English Financial Speech Data

The recordings were made by speakers from the UK, US, and other locations, totaling over 200 hours. The data covers both macro and micro financial content, achieving a sentence accuracy of 95%.

Korean Financial Speech Data

The total length is over 200 hours, and the data is annotated with text content, start and end times of valid sentences, speaker identification, gender, noise annotation, sensitive information annotation, entity annotation, and capitalization.

German Financial Speech Data

Content includes macro-level financial content such as the overall economy, market trends, financial policies, and exchange rate fluctuations, as well as micro-level financial content such as individual companies, stocks, bonds, and investment portfolios. Data with excessive background noise and echo that could affect speech recognition is removed.

Spanish Financial Speech Data

The recordings are from Latin American countries and Spain, totaling over 200 hours, and cover a wide range of financial terminology. The recordings are annotated with text content, speaker identification, gender, noise, sensitive information, people, locations, financial products, and other entity annotations.

Nexdata's proprietary data covers application scenarios such as retail, real estate, insurance, finance, healthcare, energy, and telecommunications. It covers over 20 popular languages, including Chinese, English, Arabic, and Portuguese. This data reflects the terminology, accents, and sentiments of customer service scenarios, and can be used in the development of speech recognition technology for intelligent customer service.

Nexdata is committed to building high-quality, more accurate data for our customers to prepare for various challenges. If you have data requirements, please contact Nexdata.ai at [email protected].

Nexdata Trending Multilingual Conversational Speech Datasets

Recent

Meet Nexdata at ICML 2026

Case Study: Nexdata UMI Data Collection

Case Study: Ego-Centric Data Project for Physical AI Model Development

Previous

Tens of Millions of Ready-Made Datasets: The "Hard-Core Foundation" of OCR All-Round Players

Next

Nexdata RLHF Reinforcement Learning Annotation Project Case Study