en

Please fill in your name

Mobile phone format error

Please enter the telephone

Please enter your company name

Please enter your company email

Please enter the data requirement

Successful submission! Thank you for your support.

Format error, Please fill in again

Confirm

The data requirement cannot be less than 5 words and cannot be pure numbers

m.nexdata.datatang.com

LLM Datasets

Instantly enhance AI model performance with high quality off-the-shelf datasets.

Type

All
40
Image Caption
19
SFT Datasets
6
Pre-training Text
18

250K Financial QA Dataset – MCQ & Q&A in JSON Format

This dataset contains 250,000 financial domain questions designed for academic, commercial, and AI model training use. It covers subdomains including financial products, markets, behaviors, regulations, and principles. The dataset is evenly split between multiple-choice questions (MCQs) and open-ended Q&A questions, with 125,000 entries each. All questions are provided in structured JSON format, making it highly suitable for machine learning, financial language model training, intelligent tutoring systems, and exam preparation tools. It offers a valuable resource for financial knowledge acquisition, model fine-tuning, and natural language understanding in the finance sector. All data complies with global privacy standards including GDPR, CCPA, and PIPL.
financial question dataset finance test bank finance MCQ dataset AI training data finance financial literacy dataset structured QA dataset fintech dataset finance exam preparation LLM finance training data JSON finance questions

2.4M Korean Exam Question Dataset for AI Training

This dataset contains 2.4 million structured Korean exam questions covering primary, middle, and high school subjects including Korean, Mathematics, English, Social Studies, Science, Physics, Chemistry, Biology, History, and Geography. Each record includes question type (multiple-choice, fill-in-the-blank, true/false, short answer), the question itself, standard answers, and detailed explanations. The data is professionally annotated and categorized by subject and academic level, making it ideal for training AI models in educational applications such as question answering systems, tutoring bots, academic reasoning, and subject-level knowledge enhancement. It is widely applicable for natural language processing tasks involving structured QA, exam-style NLP training, and educational content generation. All data is collected and processed in compliance with GDPR, CCPA, and PIPL standards, ensuring privacy and legal integrity throughout the lifecycle.
korean exam dataset education dataset test question dataset multiple choice QA dataset K-12 school question data AI training dataset for education NLP exam data structured Korean question dataset school subject QA dataset

Japanese Q&A Dataset from OKWAVE – 8.4M Questions

This dataset is collected from the Japanese OKWAVE Q&A platform and includes large-scale parsed and processed text data suitable for LLM training and Japanese natural language understanding. It contains structured fields such as questions, answers, categories, timestamps, user metadata, and supplementary explanations. As of April 2025, the dataset includes 8.4 million questions with 2.3 billion words, 27 million answers totaling 7.6 billion words, 15.5 million thank-you messages (1.7 billion words), and 2.1 million supplementary replies (360 million words). Continuously updated and rich in user-generated content, this dataset is ideal for building Japanese conversational AI, ChatGPT fine-tuning, question answering systems, text summarization, and semantic parsing models. All data complies with relevant data usage and privacy regulations.
Japanese Q&A dataset OKWAVE forum data Japanese language corpus Japanese dialogue dataset ChatGPT Japanese fine-tuning user-generated content question answer dataset

32M Science QA Dataset – Answers & Parsing for LLMs

32 million structured science questions covering mathematics, physics, chemistry, and biology across primary, middle, high school, and university levels. Each question entry includes a title, answer, solution parsing, question type, subject category, and corresponding grade level. The dataset is designed to support AI training tasks such as large language model development, subject-specific knowledge enhancement, machine reading comprehension, and question-answering systems. It provides a rich resource for educational NLP applications and has been validated for quality and completeness. All data complies with global data protection standards including GDPR, CCPA, and PIPL.
science question dataset STEM QA dataset math physics chemistry biology questions education NLP dataset AI training data structured question answer dataset academic QA dataset question parsing dataset K-12 science dataset university level questions dataset

1M Chinese Coding Questions Dataset – Python/Java/C++

This dataset contains 1 million Chinese programming questions with corresponding answers, detailed parses (explanations), and programming language labels. It includes a wide range of questions in C, C++, Python, Java, and JavaScript, making it ideal for training large language models (LLMs) on multilingual code understanding and generation. The questions cover fundamental to advanced topics, supporting AI applications such as code completion, bug fixing, and programming reasoning. This structured dataset enhances model performance in natural language programming tasks and helps reinforce code logic skills in AI systems. All data complies with international privacy regulations including GDPR, CCPA, and PIPL.
Chinese coding questions dataset programming QA data parsed coding problems Python Java C++ dataset code generation LLM dataset Chinese code questions

100K English Instruction Tuning Dataset – General Domain SFT for LLM Fine-Tuning

100,000 Fine-Tuning Text Dataset for English LLM General Domain SFT is a high-quality supervised fine-tuning corpus designed to optimize instruction-following capabilities in large language models. Each data point is double-verified by experienced linguistic professionals and AI engineers to ensure relevance, clarity, and effectiveness in improving model alignment and response precision. The dataset supports instruction tuning tasks across a wide range of general knowledge domains and is compatible with leading open-source LLMs such as LLaMA, Falcon, GPT-NeoX, and Mistral. Ideal for use in alignment, safety tuning, and instruction-based generation enhancement, this dataset offers a robust foundation for model adaptation and performance improvement. All data complies with global data usage and privacy standards.
LLM fine-tuning dataset supervised fine-tuning SFT dataset English instruction tuning data general domain LLM data AI model fine-tuning instruction-following training data GPT tuning dataset

50,000 Image Editing Datasets – Object Removal, Addition & Modification Dataset for AI Training

50,000 Sets - Image Editing Data. The editing types include human attribute editing, image semantic editing, and image structure editing. The editing targets cover scenes such as people, animals, goods, plants, and landscapes. In terms of annotation, based on the editing instructions, the targets that need to be edited in the image are edited. The data can be used for tasks such as image synthesis, data augmentation, and virtual scene generation.
image editing dataset image synthesis data object removal dataset object addition data AI image generation dataset virtual scene dataset annotated image editing data inpainting dataset AI training data for image manipulation generative image dataset

25K People Multi-style Video Dataset for Digital Humans

This dataset includes high-quality video data of 25,000 unique individuals, captured in a variety of styles and environmental settings. Each person ID is represented with identity-consistent video samples featuring diverse skin tones, including White, Asian, Black, and Brown, and a wide age range from youth to elderly. All videos are at least 1080p in resolution and longer than 10 seconds in duration. This dataset is ideal for training AI models in digital human creation, identity-preserving video generation, character reanimation, and virtual avatar modeling. The diversity and consistency make it highly suitable for generative AI applications. All data was collected ethically and complies with global privacy laws including GDPR, CCPA, and PIPL.
digital human dataset person video dataset multi-style human video character-consistent dataset 1080p video data diverse face video dataset avatar generation dataset

100,000 Instruction-Following Evaluation SFT for Chinese LLM Text Data

100,000 Instruction-Following Evaluation SFT for Chinese LLM Text Data. Between 50 and 400 words, with no fewer than 3 constraints in each prompt.All prompt are manually written to satisfy the diversity of coverage.
LLM Instruction-Following SFT

loading

Tailor Your Data Now

Why off-the-shelf Datasets

  • Copyright

    Copyright

    Clear Coyright and Ready to Check
  • Security

    Security

    Properly Authorized Secure to Use
  • Professional

    Professional

    Designed and produced by AI data experts
  • Diversity

    Diversity

    Collected from a varity of real scenes
  • Cost Effective

    Cost Effective

    More Cost-Efficient Than Tailored Data
  • Efficiency

    Efficiency

    Ready-To-Go Deliver in Seconds
facc83c9-f21e-47dc-bf5d-e9ca6d40e2cb