en

Please fill in your name

Mobile phone format error

Please enter the telephone

Please enter your company name

Please enter your company email

Please enter the data requirement

Successful submission! Thank you for your support.

Format error, Please fill in again

Confirm

The data requirement cannot be less than 5 words and cannot be pure numbers

m.nexdata.datatang.com

LLM Datasets

Instantly enhance AI model performance with high quality off-the-shelf datasets.

Type

All
40
Image Caption
19
SFT Datasets
6
Pre-training Text
18

100,000 Instruction-Following Evaluation SFT for Chinese LLM Text Data

100,000 Instruction-Following Evaluation SFT for Chinese LLM Text Data. Between 50 and 400 words, with no fewer than 3 constraints in each prompt.All prompt are manually written to satisfy the diversity of coverage.
LLM Instruction-Following SFT

250K Financial QA Dataset – MCQ & Q&A in JSON Format

This dataset contains 250,000 financial domain questions designed for academic, commercial, and AI model training use. It covers subdomains including financial products, markets, behaviors, regulations, and principles. The dataset is evenly split between multiple-choice questions (MCQs) and open-ended Q&A questions, with 125,000 entries each. All questions are provided in structured JSON format, making it highly suitable for machine learning, financial language model training, intelligent tutoring systems, and exam preparation tools. It offers a valuable resource for financial knowledge acquisition, model fine-tuning, and natural language understanding in the finance sector. All data complies with global privacy standards including GDPR, CCPA, and PIPL.
financial question dataset finance test bank finance MCQ dataset AI training data finance financial literacy dataset structured QA dataset fintech dataset finance exam preparation LLM finance training data JSON finance questions

20,846 Groups Image Caption Data of Cookbook

20,846 Groups Image Caption Data of Cookbook. Each set of recipes contains 4-18 images and a text description for each image. Cuisines include Chinese Cuisine, Western Cuisine, Korean Cuisine, Japanese Cuisine and so on. Description languages are Chinese and English. In terms of text length, the Chinese description should be no less than 15 words, and the English description should be no less than 30 words. The data can be used for recipe recommendations, culinary education and more.
Cookbook Image caption AIGC

6.03 Million - Majors Questions Text Parsing And Processing Data

Majors Questions Text Data, About 6.03 million majors questions with explanations and without explanations combined; Each question includes question type, question, answer, and explanation, some questions may have errors in question types; majors include Party Building, Law, Engineering, Civil Service, Computer Science, Economics, Graduate Studies, Medicine, Language, Self-Study, Comprehensive and Policy Essay Writing; question types include Multiple Choice, Single Choice, True/False, Fill in the Blanks, Short Answer, and Essay; this dataset can be used for tasks such as LLM training, chatgpt
Majors questions Text LLM

120K Multimodal QA Dataset – Visual & Text Reasoning

This dataset includes 120,000 multimodal question-answer pairs across six major academic disciplines, including medicine, engineering, art, science, and more. Each QA pair combines textual and visual content—such as charts, diagrams, blueprints, and artworks—crafted to test logical reasoning, cross-modal understanding, and domain-specific knowledge. All questions have been reviewed by subject-matter experts to ensure academic quality and accuracy. Ideal for training multimodal large language models (MLLMs), visual question answering (VQA) systems, and AI applications requiring deep contextual reasoning, this dataset supports fine-tuning tasks like knowledge grounding, visual-text alignment, and decision-making. All data complies with GDPR, CCPA, and PIPL regulations, ensuring ethical use and privacy protection.
multimodal dataset VQA dataset multimodal QA data reasoning dataset for AI image-text QA dataset domain-specific AI training data chart reasoning dataset LLM multimodal training data

2.4M Korean Exam Question Dataset for AI Training

This dataset contains 2.4 million structured Korean exam questions covering primary, middle, and high school subjects including Korean, Mathematics, English, Social Studies, Science, Physics, Chemistry, Biology, History, and Geography. Each record includes question type (multiple-choice, fill-in-the-blank, true/false, short answer), the question itself, standard answers, and detailed explanations. The data is professionally annotated and categorized by subject and academic level, making it ideal for training AI models in educational applications such as question answering systems, tutoring bots, academic reasoning, and subject-level knowledge enhancement. It is widely applicable for natural language processing tasks involving structured QA, exam-style NLP training, and educational content generation. All data is collected and processed in compliance with GDPR, CCPA, and PIPL standards, ensuring privacy and legal integrity throughout the lifecycle.
korean exam dataset education dataset test question dataset multiple choice QA dataset K-12 school question data AI training dataset for education NLP exam data structured Korean question dataset school subject QA dataset

32M Science QA Dataset – Answers & Parsing for LLMs

32 million structured science questions covering mathematics, physics, chemistry, and biology across primary, middle, high school, and university levels. Each question entry includes a title, answer, solution parsing, question type, subject category, and corresponding grade level. The dataset is designed to support AI training tasks such as large language model development, subject-specific knowledge enhancement, machine reading comprehension, and question-answering systems. It provides a rich resource for educational NLP applications and has been validated for quality and completeness. All data complies with global data protection standards including GDPR, CCPA, and PIPL.
science question dataset STEM QA dataset math physics chemistry biology questions education NLP dataset AI training data structured question answer dataset academic QA dataset question parsing dataset K-12 science dataset university level questions dataset

300M Image-Caption Pairs – Large-Scale Vision-Language Dataset for AI Training

300 Million Pairs of High-Quality Image-Caption Dataset includes a large-scale collection of photographic and vector images paired with English textual descriptions. The complete image library comprises nearly 300 million images, with a curated subset of 100 million high-quality image-caption pairs available for generative AI and vision-language model training. All images are authentic and legally licensed works created by professional photographers. The dataset primarily features English captions with minimal Chinese, offering diverse scenes, objects, and compositions suitable for tasks such as image captioning, visual question answering (VQA), image-text retrieval, and multimodal foundation model pretraining. The dataset supports large-scale LLM and VLM applications and complies with global data privacy and copyright regulations, including GDPR, CCPA, and PIPL.
image-caption dataset image-text pairs vision-language data generative AI training dataset multimodal AI dataset image description data LLM vision data AI image-text alignment high-quality image data

7 Million Sets - High-Quality Video Caption Dataset

7 million global genuine high-quality videos. All are genuine video works released by photographers around the world. 6 million of them are described in English and 1 million in Chinese. They cover a variety of categories such as people, landscapes, animals, etc. The resolution is above 1080p.
AIGC English description Chinese description Multiple video categories Multiple descriptions

loading

Tailor Your Data Now

Why off-the-shelf Datasets

  • Copyright

    Copyright

    Clear Coyright and Ready to Check
  • Security

    Security

    Properly Authorized Secure to Use
  • Professional

    Professional

    Designed and produced by AI data experts
  • Diversity

    Diversity

    Collected from a varity of real scenes
  • Cost Effective

    Cost Effective

    More Cost-Efficient Than Tailored Data
  • Efficiency

    Efficiency

    Ready-To-Go Deliver in Seconds
faa2b17b-23c0-4573-a95a-08c87f1c6115