Off-the-Shelf Datasets – 1,000+ Ready-to-Use Collections

Please fill in your name

Mobile phone format error

Please enter the telephone

Please enter your company name

Please enter your company email

Please enter the data requirement

Successful submission! Thank you for your support.

Format error, Please fill in again

Confirm

The data requirement cannot be less than 5 words and cannot be pure numbers

m.nexdata.datatang.com

Home > All Category Datasets

1,042 Segments 6-camera Egocentric Embodied AI Dataset

This dataset contains 1,042 egocentric video segments (approximately 35 seconds each) collected across 34 locations and 6 real-world environments, including homes, offices, and retail scenarios. Powered by self-developed VSLAM system, achieving millimeter-level positioning, multi-sensor hard-triggered synchronization (≤1ms), and a high frame rate of 60fps for RGB. It includes multi-view videos, calibrations, point clouds, SLAM trajectories, and gesture recognition results in standard formats. Designed for embodied AI, spatial perception, and 3D reconstruction, it offers high precision, diverse scenarios, and out-of-the-box usability, making it ideal for training robust perception-action models.

embodied ai dataset robot learning dataset robotics training data egocentric dataset robot perception dataset multimodal robotics dataset SLAM dataset

Agent Trajectory Dataset for Tool-Use Training and AI Agent Evaluation

This dataset covers office-based scenarios such as in-depth searches, data analysis, and industry research, encompassing complete multi-turn reasoning trajectories and tool-calling chains. It is designed to support the analysis of agent planning capabilities, research into tool selection strategies, and quality assessment, providing a structured benchmark for agent training and evaluation.

ai agent dataset agent training dataset agent trajectory dataset llm agent dataset tool use dataset agent evaluation datase

Japan Autonomous Driving Dataset with Multi-Sensor Annotations for ADAS & Autonomous Vehicles

This dataset contains high-precision multi-sensor autonomous driving data collected from real vehicles operating in Japan. The dataset supports perception model development, sensor fusion, 3D object detection, multi-object tracking, lane detection, HD map construction, localization, and algorithm validation. This dataset is well suited for autonomous vehicle perception and automotive AI model training.

autonomous driving dataset autonomous vehicle dataset adas dataset autonomous driving data multi sensor dataset sensor fusion dataset lidar camera dataset multimodal driving dataset

262 Hours - Japanese Children Speech Dataset for ASR and Pronunciation Training

This dataset contains approximately 262 hours of Japanese children's speech data collected from 411 speakers aged 6 to 13, including 147,668 scripted utterances with transcriptions. The speakers are categorized into lower-grade (ages 6–9, 179 speakers) and upper-grade (ages 10–13, 232 speakers) groups, with balanced gender distribution. Recordings were collected using smartphones in 16kHz/16bit mono WAV format and include both utterance transcriptions and read-aloud scripts. The dataset is applicable to tasks such as Japanese children's ASR, TTS, speaker recognition, and pronunciation assessment.

japanese children speech dataset pediatric speech dataset children speech dataset kids speech dataset children tts dataset

10,000-Hour Egocentric Video Dataset for Robotics and AI Manipulation Training

This dataset contains 10,000 hours of egocentric multimodal data collected from diverse real-world environments, including residential, retail, and office scenarios. It covers a wide range of human activities and manipulation tasks, such as meal preparation, cleaning, storage, garment care, merchandising, and object picking. Each sample includes synchronized 4K stereo video, camera calibration parameters, 76-point full-body pose annotations, and fine-grained step-by-step action sequence labels. The dataset is suitable for robot learning, manipulation policy development, and Vision-Language-Action (VLA) models.

embodied ai dataset robotics dataset robot learning dataset robot manipulation dataset vla dataset egocentric video dataset

Long Context Reasoning Dataset – Multi-Language (EN/CH/KR) Benchmark for LLM Evaluation

This dataset is designed to tackle the core weaknesses of today's large language models when it comes to processing long documents and performing complex reasoning. It consists of 7,500 high-quality training examples across three languages—Chinese, English, and Korean. Each instance is built around a long-text passage and includes questions that require synthesizing information across paragraphs and documents, while following multi-step logical chains. The goal is to offer a thorough and rigorous evaluation framework that tests a model's ability to perceive long-range context, retrieve relevant information, construct sound reasoning paths, and trace evidence back to its source.

long context dataset long context reasoning dataset LLM long context dataset long document QA dataset multi hop reasoning dataset reasoning dataset for LLM multi step reasoning dataset

300 Hours Italian Financial Speech Dataset – Real-World Conversations with Financial Entity Annotations

300 hours of Italian financial speech dataset featuring real-world conversations and monologue recordings enriched with financial entity annotations. The dataset covering various financial professional terminologies, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

banking speech dataset italian financial speech dataset financial entity dataset financial speech dataset italian financial ASR dataset financial conversation dataset financial speech recognition dataset

300 Hours Arabic Financial Speech Dataset – Banking & FinTech Audio for Speech Recognition

300 hours of Arabic financial speech dataset featuring real-world conversations and monologue recordings across a wide range of financial topics and terminology. The dataset mirrors real-world interactions. All recordings are transcribed and include rich metadata such as speaker ID, gender, and other attributes. Collected from diverse speakers across multiple regions, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

arabic financial speech dataset banking speech dataset fintech speech dataset financial NLP dataset financial conversational AI dataset

300 Hours Portuguese Financial Speech Dataset – Banking Conversations for ASR Training

THis dataset covering a wide range of financial and banking terminologies, reflecting authentic real-world interactions. The dataset includes high-quality audio recordings with transcriptions, speaker IDs, gender information, and other relevant metadata. Collected from a large pool of geographically diverse speakers, the dataset helps improve model performance across complex real-world financial communication scenarios. It has been quality-tested by multiple AI companies. We strictly adhere to data protection regulations and privacy standards throughout the data collection, storage, and usage processes. our datasets are all GDPR, CCPA, PIPL complied.

financial speech dataset labeled financial speech dataset financial domain speech dataset portuguese speech dataset portuguese asr dataset

300 Hours French Financial Speech Dataset for ASR, Voice AI and Conversational AI Training

This French financial speech dataset encompasses a wide range of financial terminology and authentically reflects real-world interactions. It includes transcripts, speaker IDs, gender information, and other attributes. Collected from speakers with diverse geographic and personal backgrounds, the dataset helps improve model performance in complex, real-world tasks and has undergone quality validation by multiple AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

french financial speech dataset french speech dataset french speech corpus french asr dataset french voice dataset financial conversation dataset

300 Hours Spanish Financial Speech Dataset – Banking Audio for ASR, Voice AI and LLM Training

This dataset encompasses a wide range of specialized financial terminology and authentically reflects real-world interactions. It includes transcripts, speaker IDs, gender information, and other attributes. Collected from a geographically and demographically diverse group of speakers, the dataset helps improve model performance in complex, real-world tasks and has undergone quality validation by multiple AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

spanish speech dataset spanish financial speech dataset spanish banking dataset rag dataset for finance financial conversational dataset

300 Hours German Financial Speech Dataset – Banking & FinTech Audio with Transcriptions

german financial speech dataset german speech dataset german asr training data conversational ai dataset financial ai training data

. . .

Subscribe to our newsletter

Be the first to receive Nexdata latest product releases, data solutions and enterprise news.

Off-the-Shelf Datasets: All Category Datasets; Embodied AI Datasets; LLM Datasets; Computer Vision Datasets; Speech Recognition Datasets; Speech Synthesis Datasets; OCR Datasets; Pronunciation Dictionary; NLU Datasets

Data Service: 3D Point Cloud Data; Street View Data; OCR Data; Behavior Recognition Data; Identity Recognition Data; Speech Recognition Data; Speech Synthesis Data; Multimodal Data

Industries: Embodied AI; Generative AI; Autonomous Vehicles; AR/VR; Conversational AI; Smart Home; Retail; Intelligent Healthcare

Company: About Us; News; Partners; Quality & Security; Event
Links: OPENMPD; DataPlus; Datarade

Platform: Platform
Competition: Competition
Resources: Sponsored Datasets

Sharpen Your AI with Better Data

+1(626)594-5598

[email protected]

Sitemap Terms and Conditions

We use cookies to enhance your browsing experience, serve personalized ads or content, and analyze our traffic. By clicking "Accept All", you consent to our use of cookies.

e2638f8c-d259-41dd-bc5f-89e989ca0f51

High-Quality Training Datasets

Language

Data Type