500 Hours - Wuhan Dialect Conversation (Bilingual Annotated) Speech Data by Mobile Phone

Wuhan

Dialect

Conversation

Wuhan Dialect(China) Spontaneous Dialogue Smartphone speech dataset, transcribed with text content, timestamp, speaker's ID, gender and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

This is a paid datasets for commercial use, research purpose and more. Licensed ready made datasets help jump-start AI projects.

Sample

Audio
要是如果说哎反正正咱觉得出去玩啊。[N] 要是如果说哎反正现在觉得出去玩啊。
Audio
开拓眼界确实是，让人蛮心情蛮蛮愉快呀。[N] 开拓眼界确实是，让人很心情很很愉快呀。
Audio
是的你要谈那个旅行的话，正咱的话就蛮提倡周边游。[N] 是的你要谈那个旅行的话，现在的话就很提倡周边游。
Audio
乡村游，是不是啊，一日游两日游是吧，我觉得这还是蛮好。[N] 乡村游，是不是啊，一日游两日游是吧，我觉得这还是很好。
Audio
大家都出去玩哈子，看哈子，看哈子那个呢，你像正咱马上也可以看油菜花了呢。[N] 大家都出去玩一下，看一下，看一下那个呢，你像现在马上也可以看油菜花了呢。

Recommended Dataset

500 Hours - Japanese Full-Duplex Multi-Channel Speech Dataset (48khz)

This data collected from dialogues based on given topics. Transcribed with text content, speaker's ID, gender, age and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

Full-Duplex Speech Dataset Multi-Channel Speech Dataset Japanese Speech Dataset Japanese Audio Dataset

601 Hours - Spanish(Argentina) Real-world Casual Conversation and Monologue speech dataset

Spanish(Argentina) Real-world Casual Conversation and Monologue speech dataset, covers self-media, conversation, variety show and other generic domains, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

Spanish Casual Conversation ASR Argentina

INTERSPEECH 2025 MLC-SLM Challenge Dataset

The INTERSPEECH 2025 MLC-SLM Challenge Dataset, curated by Nexdata, is derived from fifteen proprietary conversational speech corpora. Distinguished by exceptional annotation accuracy and operational reliability, this dataset is engineered to address critical challenges in multilingual automatic speech recognition (ASR) and long-context comprehension. It meticulously replicates real-world complexities including spontaneous interruptions and speaker overlaps across 11 languages (1500 hours total duration), thereby providing robust training resources for developing world-ready ASR systems. All data collection and processing strictly comply with international privacy regulations including GDPR, CCPA and PIPL, with rigorous protocols ensuring participant anonymity and ethical data usage throughout the lifecycle.

workshop audio dataset mlc-slm dataset ASR speech recognition data

4600 Hours - Mandarin Full-Duplex Multi-Channel Speech Dataset

4600 Hours Mandarin Full-Duplex Multi-Channel Speech Dataset is collected from dialogues based on given topics. Transcribed with text content, speaker's ID, gender, age and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

Mandarin speech dataset multi-stream Mandarin audio data conversational Mandarin corpus Chinese voice dataset full-duplex speech dataset multi-stream speech dataset multi-channel audio dataset

581 Hours Greek Speech Dataset – Real world Casual Conversation & Monologue for ASR

The 600 Hours Greek Real-World Speech Dataset includes both casual conversations and monologues, covers self-media, conversation, live, variety show and other generic domains, mirroring real-world interactions. Transcribed with text content, speaker's ID, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

greek speech dataset greek ASR training data greek conversation corpus greek monologue speech greek speech recognition dataset speech-to-text greek data greek voice dataset greek transcription dataset

600 Hours Norwegian Speech Dataset – Real-world Casual Conversation & Monologue for ASR

The 600 Hours Norwegian Real-World Speech Dataset includes both casual conversations and monologues, covering domains such as self-media, live shows, and other generic domains, mirroring real-world interactions. Transcribed with text content, speaker's ID, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

norwegian speech dataset norwegian ASR training data norwegian conversation corpus norwegian monologue speech norwegian speech recognition dataset speech-to-text norwegian data norwegian voice dataset multilingual speech data norwegian transcription dataset

Gujatati(India) Speech Dataset (Scripted Dialogue)

This dataset contains Gujarati speech, covers several domains, mirrors real-world interactions. Transcribed with text content, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

gujarati audio dataset gujarati asr dataset gujarati speech dataset gujarati tts dataset

Spanish(Mexico) Real-world Casual Conversation and Monologue speech dataset

Spanish(Mexico) Real-world Casual Conversation and Monologue speech dataset, covers self-media, conversation, variety show and other generic domains, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

Mexico Spanish Casual Conversation ASR

500 Hours - Wuhan Dialect Conversation (Bilingual Annotated) Speech Data by Mobile Phone

Wuhan Dialect Conversation

Current Project Maturity

Wuhan

Dialect

Conversation