en

Please fill in your name

Mobile phone format error

Please enter the telephone

Please enter your company name

Please enter your company email

Please enter the data requirement

Successful submission! Thank you for your support.

Format error, Please fill in again

Confirm

The data requirement cannot be less than 5 words and cannot be pure numbers

m.nexdata.datatang.com

LLM Datasets

Instantly enhance AI model performance with high quality off-the-shelf datasets.

Type

All
40
Image Caption
19
SFT Datasets
6
Pre-training Text
18

25K People Multi-style Video Dataset for Digital Humans

This dataset includes high-quality video data of 25,000 unique individuals, captured in a variety of styles and environmental settings. Each person ID is represented with identity-consistent video samples featuring diverse skin tones, including White, Asian, Black, and Brown, and a wide age range from youth to elderly. All videos are at least 1080p in resolution and longer than 10 seconds in duration. This dataset is ideal for training AI models in digital human creation, identity-preserving video generation, character reanimation, and virtual avatar modeling. The diversity and consistency make it highly suitable for generative AI applications. All data was collected ethically and complies with global privacy laws including GDPR, CCPA, and PIPL.
digital human dataset person video dataset multi-style human video character-consistent dataset 1080p video data diverse face video dataset avatar generation dataset

50,000 Image Editing Datasets – Object Removal, Addition & Modification Dataset for AI Training

50,000 Sets - Image Editing Data. The editing types include human attribute editing, image semantic editing, and image structure editing. The editing targets cover scenes such as people, animals, goods, plants, and landscapes. In terms of annotation, based on the editing instructions, the targets that need to be edited in the image are edited. The data can be used for tasks such as image synthesis, data augmentation, and virtual scene generation.
image editing dataset image synthesis data object removal dataset object addition data AI image generation dataset virtual scene dataset annotated image editing data inpainting dataset AI training data for image manipulation generative image dataset

100,000 Instruction-Following Evaluation SFT for Chinese LLM Text Data

100,000 Instruction-Following Evaluation SFT for Chinese LLM Text Data. Between 50 and 400 words, with no fewer than 3 constraints in each prompt.All prompt are manually written to satisfy the diversity of coverage.
LLM Instruction-Following SFT

250K Financial QA Dataset – MCQ & Q&A in JSON Format

This dataset contains 250,000 financial domain questions designed for academic, commercial, and AI model training use. It covers subdomains including financial products, markets, behaviors, regulations, and principles. The dataset is evenly split between multiple-choice questions (MCQs) and open-ended Q&A questions, with 125,000 entries each. All questions are provided in structured JSON format, making it highly suitable for machine learning, financial language model training, intelligent tutoring systems, and exam preparation tools. It offers a valuable resource for financial knowledge acquisition, model fine-tuning, and natural language understanding in the finance sector. All data complies with global privacy standards including GDPR, CCPA, and PIPL.
financial question dataset finance test bank finance MCQ dataset AI training data finance financial literacy dataset structured QA dataset fintech dataset finance exam preparation LLM finance training data JSON finance questions

20,846 Groups Image Caption Data of Cookbook

20,846 Groups Image Caption Data of Cookbook. Each set of recipes contains 4-18 images and a text description for each image. Cuisines include Chinese Cuisine, Western Cuisine, Korean Cuisine, Japanese Cuisine and so on. Description languages are Chinese and English. In terms of text length, the Chinese description should be no less than 15 words, and the English description should be no less than 30 words. The data can be used for recipe recommendations, culinary education and more.
Cookbook Image caption AIGC

6.03 Million - Majors Questions Text Parsing And Processing Data

Majors Questions Text Data, About 6.03 million majors questions with explanations and without explanations combined; Each question includes question type, question, answer, and explanation, some questions may have errors in question types; majors include Party Building, Law, Engineering, Civil Service, Computer Science, Economics, Graduate Studies, Medicine, Language, Self-Study, Comprehensive and Policy Essay Writing; question types include Multiple Choice, Single Choice, True/False, Fill in the Blanks, Short Answer, and Essay; this dataset can be used for tasks such as LLM training, chatgpt
Majors questions Text LLM

120K Multimodal QA Dataset – Visual & Text Reasoning

This dataset includes 120,000 multimodal question-answer pairs across six major academic disciplines, including medicine, engineering, art, science, and more. Each QA pair combines textual and visual content—such as charts, diagrams, blueprints, and artworks—crafted to test logical reasoning, cross-modal understanding, and domain-specific knowledge. All questions have been reviewed by subject-matter experts to ensure academic quality and accuracy. Ideal for training multimodal large language models (MLLMs), visual question answering (VQA) systems, and AI applications requiring deep contextual reasoning, this dataset supports fine-tuning tasks like knowledge grounding, visual-text alignment, and decision-making. All data complies with GDPR, CCPA, and PIPL regulations, ensuring ethical use and privacy protection.
multimodal dataset VQA dataset multimodal QA data reasoning dataset for AI image-text QA dataset domain-specific AI training data chart reasoning dataset LLM multimodal training data

2.4M Korean Exam Question Dataset for AI Training

This dataset contains 2.4 million structured Korean exam questions covering primary, middle, and high school subjects including Korean, Mathematics, English, Social Studies, Science, Physics, Chemistry, Biology, History, and Geography. Each record includes question type (multiple-choice, fill-in-the-blank, true/false, short answer), the question itself, standard answers, and detailed explanations. The data is professionally annotated and categorized by subject and academic level, making it ideal for training AI models in educational applications such as question answering systems, tutoring bots, academic reasoning, and subject-level knowledge enhancement. It is widely applicable for natural language processing tasks involving structured QA, exam-style NLP training, and educational content generation. All data is collected and processed in compliance with GDPR, CCPA, and PIPL standards, ensuring privacy and legal integrity throughout the lifecycle.
korean exam dataset education dataset test question dataset multiple choice QA dataset K-12 school question data AI training dataset for education NLP exam data structured Korean question dataset school subject QA dataset

32M Science QA Dataset – Answers & Parsing for LLMs

32 million structured science questions covering mathematics, physics, chemistry, and biology across primary, middle, high school, and university levels. Each question entry includes a title, answer, solution parsing, question type, subject category, and corresponding grade level. The dataset is designed to support AI training tasks such as large language model development, subject-specific knowledge enhancement, machine reading comprehension, and question-answering systems. It provides a rich resource for educational NLP applications and has been validated for quality and completeness. All data complies with global data protection standards including GDPR, CCPA, and PIPL.
science question dataset STEM QA dataset math physics chemistry biology questions education NLP dataset AI training data structured question answer dataset academic QA dataset question parsing dataset K-12 science dataset university level questions dataset

loading

Tailor Your Data Now

Why off-the-shelf Datasets

  • Copyright

    Copyright

    Clear Coyright and Ready to Check
  • Security

    Security

    Properly Authorized Secure to Use
  • Professional

    Professional

    Designed and produced by AI data experts
  • Diversity

    Diversity

    Collected from a varity of real scenes
  • Cost Effective

    Cost Effective

    More Cost-Efficient Than Tailored Data
  • Efficiency

    Efficiency

    Ready-To-Go Deliver in Seconds
faa3f10b-9a98-4ac6-b4ed-cc758cbce019