en

Please fill in your name

Mobile phone format error

Please enter the telephone

Please enter your company name

Please enter your company email

Please enter the data requirement

Successful submission! Thank you for your support.

Format error, Please fill in again

Confirm

The data requirement cannot be less than 5 words and cannot be pure numbers

m.nexdata.datatang.com

High-Quality Training Datasets

Boost the performance of your AI models with our high-quality, ready-to-use training datasets.

Language

All

Data Type

All

288 Million 3D Models & Scenes Dataset for AI and Simulation

Massive 3D Models & Scenes Dataset includes 270 million sets of 3D models and 18 million 3D scenes. 3D models cover conventional models, interactive models, and physics-enhanced models with various objects in indoor residential environments. 3D scenes cover indoor home decoration scenarios and commercial space environments. This dataset can be used for tasks like 3D asset generation, virtual environment simulation, AI model training, and industrial design applications.
3D models dataset 3D scenes dataset indoor 3D environment dataset commercial 3D space dataset physics-enhanced 3D models interactive 3D models dataset 3D assets generation dataset simulation training environment dataset virtual environment 3D data large-scale 3D AI dataset

INTERSPEECH 2025 MLC-SLM Challenge Dataset

The INTERSPEECH 2025 MLC-SLM Challenge Dataset, curated by Nexdata, is derived from fifteen proprietary conversational speech corpora. Distinguished by exceptional annotation accuracy and operational reliability, this dataset is engineered to address critical challenges in multilingual automatic speech recognition (ASR) and long-context comprehension. It meticulously replicates real-world complexities including spontaneous interruptions and speaker overlaps across 11 languages (1500 hours total duration), thereby providing robust training resources for developing world-ready ASR systems. All data collection and processing strictly comply with international privacy regulations including GDPR, CCPA and PIPL, with rigorous protocols ensuring participant anonymity and ethical data usage throughout the lifecycle.
workshop audio dataset mlc-slm dataset ASR speech recognition data

3000 Hours - Mandarin Full-Duplex Spontaneous Dialogue Speech Dataset

Mandarin Full-Duplex Spontaneous Dialogue Speech Dataset, collected from dialogues based on given topics. Transcribed with text content, speaker's ID, gender, age and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
Full-Duplex Dialogue Mandarin

500,000 Images – Multilingual OCR Dataset in 21 Languages

This dataset covers 21 languages, with 20,000 to 25,000 images per language. The data includes natural scenes, document photography scenes, and electronic scenes. The data diversity includes various data types, multiple shooting angles, and multiple languages. In terms of annotation, quadrilateral or polygonal at the row (column) level and content transcription at the row (column) level are adopted. This dataset can be use for multilingual optical character recognition (OCR) and text detection tasks.
multilingual OCR dataset scene text recognition data document OCR dataset electronic screen OCR data OCR dataset 21 languages AI OCR training data text recognition dataset

Landmark Image Dataset – 200K Global Building Photos with Captions

This dataset contains 200,000 sets of images and bilingual captions (Chinese and English) featuring landmark buildings from over 20 countries, including the United States, United Kingdom, France, Germany, and Russia. Each set includes 1–10 images of a specific landmark, captured from different angles, distances, and time periods. The dataset covers approximately 80,000 domestic landmarks and 120,000 international ones. Types of landmarks include commercial buildings, ancient architecture, monuments, libraries, and scenic spots. Annotations include landmark country, city, location, category, and descriptive captions. This high-quality dataset is ideal for training models in landmark recognition, image classification, multilingual image captioning, and image-to-text retrieval.
landmark image dataset building recognition dataset global landmark image caption dataset bilingual image caption data Chinese-English caption dataset landmark classification dataset image-text dataset tourism landmark dataset cultural heritage image dataset image captioning for AI training

600 Hours Greek Speech Dataset – Real world Casual Conversation & Monologue for ASR

The 600 Hours Greek Real-World Speech Dataset includes both casual conversations and monologues, covers self-media, conversation, live, variety show and other generic domains, mirroring real-world interactions. Transcribed with text content, speaker's ID, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
greek speech dataset greek ASR training data greek conversation corpus greek monologue speech greek speech recognition dataset speech-to-text greek data greek voice dataset greek transcription dataset

600 Hours Norwegian Speech Dataset – Real-world Casual Conversation & Monologue for ASR

The 600 Hours Norwegian Real-World Speech Dataset includes both casual conversations and monologues, covering domains such as self-media, live shows, and other generic domains, mirroring real-world interactions. Transcribed with text content, speaker's ID, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
norwegian speech dataset norwegian ASR training data norwegian conversation corpus norwegian monologue speech norwegian speech recognition dataset speech-to-text norwegian data norwegian voice dataset multilingual speech data norwegian transcription dataset

3D Synthetic Sensor Dataset for DMS – Images, Video & Point Clouds

This 3D high-fidelity synthetic dataset simulates real-world Driver Monitoring System (DMS) environments using photorealistic 3D scene modeling. It includes multi-modal sensor outputs such as camera images, videos, and point clouds, all generated through simulation. The dataset is richly annotated with object classification, detection, and segmentation labels, as well as human pose data (head, eye, arm, and leg position/orientation), camera parameters, and temporal metadata such as illumination and weather conditions. Ideal for training and evaluating models in autonomous driving, robotics, driver monitoring, computer vision, and synthetic perception tasks.
3D synthetic data driver monitoring synthetic dataset autonomous driving synthetic data high-fidelity simulation dataset synthetic point cloud data camera simulation dataset human pose synthetic dataset synthetic lidar dataset 3D environment modeling robotics synthetic data DMS dataset

Japanese Q&A Dataset from OKWAVE – 8.4M Questions

This dataset is collected from the Japanese OKWAVE Q&A platform and includes large-scale parsed and processed text data suitable for LLM training and Japanese natural language understanding. It contains structured fields such as questions, answers, categories, timestamps, user metadata, and supplementary explanations. As of April 2025, the dataset includes 8.4 million questions with 2.3 billion words, 27 million answers totaling 7.6 billion words, 15.5 million thank-you messages (1.7 billion words), and 2.1 million supplementary replies (360 million words). Continuously updated and rich in user-generated content, this dataset is ideal for building Japanese conversational AI, ChatGPT fine-tuning, question answering systems, text summarization, and semantic parsing models. All data complies with relevant data usage and privacy regulations.
Japanese Q&A dataset OKWAVE forum data Japanese language corpus Japanese dialogue dataset ChatGPT Japanese fine-tuning user-generated content question answer dataset

Tamil Speech Dataset – 500 Hours Monologue Audio Corpus

This dataset includes 500 hours of scripted Tamil monologue speech collected using smartphones. Each sample is transcribed with text content and metadata such as speaker ID, gender, and age. The dataset features diverse speakers from various regions, making it highly representative of real-world Tamil language use and suitable for automatic speech recognition (ASR), text-to-speech (TTS), voice activity detection (VAD), and natural language processing (NLP) tasks. Validated by leading AI companies, the dataset is designed to enhance model robustness in multilingual environments and low-resource languages. All data was collected in full compliance with global privacy regulations including GDPR, CCPA, and PIPL, ensuring ethical sourcing and responsible AI development.
Tamil speech dataset Tamil audio dataset Tamil language dataset Tamil monologue dataset Tamil voice corpus Tamil ASR data scripted speech in Tamil smartphone Tamil dataset speech recognition Tamil dataset multilingual speech data

500 Hours - Lao Scripted Monologue Smartphone Speech Dataset

Lao Scripted Monologue Smartphone speech dataset, collected from monologue based on given scripts. Transcribed with text content and other attributes. Our dataset was collected from extensive and diversify speakers geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
laotian reading

Chinese Multi-emotional Modal particle and Natural Conversation Speech Synthesis Corpus

Chinese Multi-emotional Modal particle and Natural Conversation Speech Synthesis Corpus, is recorded by multiple native Chinese voice actors. It not only includes sentences rich in modal particles that align with daily expression habits, but also encompasses free conversation data on given topics. In each conversation, the audio of each speaker is independently stored in their respective tracks. Professional phoneticians have annotated information such as text content, meeting the precise requirements for speech synthesis research and development to a full extent.
Chinese Multi-emotional Modal particle Natural Conversation Speech Synthesis TTS
. . .
loading

loading

be0ddad0-f45f-4900-8037-f0096316cd0d