633 Hours - Japanese Speech Dataset (Mobile Phone Recordings)

Japanese speech dataset

Japanese spontaneous dialogue dataset

Japanese ASR training data

Japanese audio dataset

This dataset contains 633 hours of Japanese spontaneous dialogues, dialogues are based on given topics. Transcribed with text content, timestamp, speaker's ID, gender and other attributes. Our dataset was collected from extensive and diversify speakers(around 1000 native speakers), geographicly speaking, enhancing model performance in real and complex tasks like Automatic Speech Recognition (ASR), Text-to-Speech (TTS) systems, and NLP research. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

This is a paid datasets for commercial use, research purpose and more. Licensed ready made datasets help jump-start AI projects.

Recommended Dataset

200 Hours - Malay(Malaysia) Spontaneous Dialogue Smartphone speech dataset

Malay(Malaysia) Spontaneous Dialogue Smartphone speech dataset, collected from dialogues based on given topics, covering 20+ domains. Transcribed with text content, speaker's ID, gender, age and other attributes. Our dataset was collected from extensive and diversify speakers(228 native speakers), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

Malay Conversational

230 Hours - Burmese Spontaneous Speech Data

Burmese(Myanmar) Real-world Casual Conversation and Monologue speech dataset, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

Burmese Colloquial Video

503 Hours - Russian(Russia) Real-world Casual Conversation and Monologue speech dataset

Russian(Russia) Real-world Casual Conversation and Monologue speech dataset, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

russia Spontaneous Speech Russian

396 Hours - Korean(Korea) Real-world Casual Conversation and Monologue speech dataset

Korean(Korea) Real-world Casual Conversation and Monologue speech dataset, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

Spontaneous Speech korean

1003 Hours - Hindi Speech Dataset (Spontaneous Conversation)

This dataset contains 1003 hours of Hindi speech audio, mirrors real-world interactions. Each utterance is transcribed with text content, speaker's ID, gender, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

Hindi speech dataset Hindi ASR dataset Hindi TTS dataset Hindi audio dataset Hindi voice dataset

1900 Hours - Indonesian(Indonesia) Real-world Casual Conversation and Monologue speech dataset

Indonesian(Indonesia) Real-world Casual Conversation and Monologue speech dataset, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

Indonesian Casual Conversation Monologue Asr

1,013 Hours - English(Britain) Real-world Casual Conversation and Monologue speech dataset

English(Britain) Real-world Casual Conversation and Monologue speech dataset, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

Spontaneous Speech british english

Korean Telephony Speech Dataset – 136 Hours of Spontaneous Calls

This Korean Telephony Speech Dataset contains 136 hours of spontaneous dialogue recorded over phone calls. Covering over 20 real-life domains including customer service, e-commerce, finance, travel, and daily conversations, the dataset features natural two-speaker conversations collected via diverse telephony channels. Each sample is transcribed and annotated with speaker ID, gender, age, and other metadata. Data was collected from 216 native Korean speakers across different regions, enhancing model generalization. Ideal for automatic speech recognition (ASR), speaker diarization, and call center conversational AI systems. All data complies with GDPR, CCPA, and PIPL for responsible and legal AI development.

Korean telephony speech dataset Korean telephone audio telephone conversation Korean call center voice dataset Korean Korean spoken dialogue corpus multilingual telephony dataset Korean voice dataset speech-to-text Korean phone call spontaneous Korean speech data

633 Hours - Japanese Speech Dataset (Mobile Phone Recordings)

Japanese speech dataset Japanese spontaneous dialogue dataset Japanese ASR training data Japanese audio dataset

Current Project Maturity

Japanese speech dataset

Japanese spontaneous dialogue dataset

Japanese ASR training data

Japanese audio dataset