849 Hours Arabic Speech Dataset (Saudi Arabia) for ASR Training

Arabic colloquial speech data

Arabic colloquial video

Arabic multimodal data

Arabic natural dialogue data

Saudi Arabian natural dialogue data

Saudi Arabian multimodal data

Arabic speech dataset

Arabic ASR dataset

Arabic TTS dataset

This dataset contains 849 hours of Arabic speech from Saudi Arabia, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

This is a paid datasets for commercial use, research purpose and more. Licensed ready made datasets help jump-start AI projects.

Recommended Dataset

200 Hours - Malay(Malaysia) Spontaneous Dialogue Smartphone speech dataset

Malay(Malaysia) Spontaneous Dialogue Smartphone speech dataset, collected from dialogues based on given topics, covering 20+ domains. Transcribed with text content, speaker's ID, gender, age and other attributes. Our dataset was collected from extensive and diversify speakers(228 native speakers), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

Malay Conversational

230 Hours - Burmese Spontaneous Speech Data

Burmese(Myanmar) Real-world Casual Conversation and Monologue speech dataset, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

Burmese Colloquial Video

503 Hours - Russian(Russia) Real-world Casual Conversation and Monologue speech dataset

Russian(Russia) Real-world Casual Conversation and Monologue speech dataset, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

russia Spontaneous Speech Russian

396 Hours - Korean(Korea) Real-world Casual Conversation and Monologue speech dataset

Korean(Korea) Real-world Casual Conversation and Monologue speech dataset, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

Spontaneous Speech korean

1003 Hours - Hindi Speech Dataset (Spontaneous Conversation)

This dataset contains 1003 hours of Hindi speech audio, mirrors real-world interactions. Each utterance is transcribed with text content, speaker's ID, gender, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

Hindi speech dataset Hindi ASR dataset Hindi TTS dataset Hindi audio dataset Hindi voice dataset

1900 Hours - Indonesian(Indonesia) Real-world Casual Conversation and Monologue speech dataset

Indonesian(Indonesia) Real-world Casual Conversation and Monologue speech dataset, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

Indonesian Casual Conversation Monologue Asr

1,013 Hours - English(Britain) Real-world Casual Conversation and Monologue speech dataset

English(Britain) Real-world Casual Conversation and Monologue speech dataset, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

Spontaneous Speech british english

Korean Telephony Speech Dataset – 136 Hours of Spontaneous Calls

This Korean Telephony Speech Dataset contains 136 hours of spontaneous dialogue recorded over phone calls. Covering over 20 real-life domains including customer service, e-commerce, finance, travel, and daily conversations, the dataset features natural two-speaker conversations collected via diverse telephony channels. Each sample is transcribed and annotated with speaker ID, gender, age, and other metadata. Data was collected from 216 native Korean speakers across different regions, enhancing model generalization. Ideal for automatic speech recognition (ASR), speaker diarization, and call center conversational AI systems. All data complies with GDPR, CCPA, and PIPL for responsible and legal AI development.

Korean telephony speech dataset Korean telephone audio telephone conversation Korean call center voice dataset Korean Korean spoken dialogue corpus multilingual telephony dataset Korean voice dataset speech-to-text Korean phone call spontaneous Korean speech data

849 Hours Arabic Speech Dataset (Saudi Arabia) for ASR Training

Arabic colloquial speech data Arabic colloquial video Arabic multimodal data Arabic natural dialogue data Saudi Arabian natural dialogue data Saudi Arabian multimodal data Arabic speech dataset Arabic ASR dataset Arabic TTS dataset

Current Project Maturity

Arabic colloquial speech data

Arabic colloquial video

Arabic multimodal data

Arabic natural dialogue data

Saudi Arabian natural dialogue data

Saudi Arabian multimodal data

Arabic speech dataset

Arabic ASR dataset

Arabic TTS dataset