LLM Training Datasets – SFT, Pre-training & Caption

Please fill in your name

Mobile phone format error

Please enter the telephone

Please enter your company name

Please enter your company email

Please enter the data requirement

Successful submission! Thank you for your support.

Format error, Please fill in again

Confirm

The data requirement cannot be less than 5 words and cannot be pure numbers

m.nexdata.datatang.com

Home > All Category Datasets > LLM Datasets

Type

All

Image Caption

SFT Datasets

Pre-training Text

1.51M Instruction-Based Image Editing Dataset for Generative AI Training

This dataset contains 1.51 million annotated image editing pairs. Editing types include 500,000 sets of portrait/object consistency editing, 300,000 sets of structural edits, 210,000 sets of mixed editing, and 450,000 sets of spatial editing, and 50,000 sets of style transfer editing. The editing targets cover scenes such as people, animals, goods, plants, and landscapes. In terms of annotation, the targets that need to be edited in the image are edited according to the editing instructions. The data can be used for tasks such as image synthesis, data augmentation, and virtual scene generation.

generative AI image dataset image editing dataset AI image editing dataset image editing training data AI image manipulation dataset image editing pairs dataset image inpainting dataset style transfer dataset

2.4M Korean Exam Question Dataset for AI Training

This dataset contains 2.4 million structured Korean exam questions covering primary, middle, and high school subjects including Korean, Mathematics, English, Social Studies, Science, Physics, Chemistry, Biology, History, and Geography. Each record includes question type (multiple-choice, fill-in-the-blank, true/false, short answer), the question itself, standard answers, and detailed explanations. The data is professionally annotated and categorized by subject and academic level, making it ideal for training AI models in educational applications such as question answering systems, tutoring bots, academic reasoning, and subject-level knowledge enhancement. It is widely applicable for natural language processing tasks involving structured QA, exam-style NLP training, and educational content generation. All data is collected and processed in compliance with GDPR, CCPA, and PIPL standards, ensuring privacy and legal integrity throughout the lifecycle.

korean exam dataset education dataset test question dataset multiple choice QA dataset K-12 school question data AI training dataset for education NLP exam data structured Korean question dataset school subject QA dataset

50,000 Image Editing Datasets – Object Removal, Addition & Modification Dataset for AI Training

50,000 Sets - Image Editing Data. The editing types include human attribute editing, image semantic editing, and image structure editing. The editing targets cover scenes such as people, animals, goods, plants, and landscapes. In terms of annotation, based on the editing instructions, the targets that need to be edited in the image are edited. The data can be used for tasks such as image synthesis, data augmentation, and virtual scene generation.

image editing dataset image synthesis data object removal dataset object addition data AI image generation dataset virtual scene dataset annotated image editing data inpainting dataset AI training data for image manipulation generative image dataset

32M Science QA Dataset – Answers & Parsing for LLMs

32 million structured science questions covering mathematics, physics, chemistry, and biology across primary, middle, high school, and university levels. Each question entry includes a title, answer, solution parsing, question type, subject category, and corresponding grade level. The dataset is designed to support AI training tasks such as large language model development, subject-specific knowledge enhancement, machine reading comprehension, and question-answering systems. It provides a rich resource for educational NLP applications and has been validated for quality and completeness. All data complies with global data protection standards including GDPR, CCPA, and PIPL.

science question dataset STEM QA dataset math physics chemistry biology questions education NLP dataset AI training data structured question answer dataset academic QA dataset question parsing dataset K-12 science dataset university level questions dataset

1M Chinese Coding Questions Dataset – Python/Java/C++

This dataset contains 1 million Chinese programming questions with corresponding answers, detailed parses (explanations), and programming language labels. It includes a wide range of questions in C, C++, Python, Java, and JavaScript, making it ideal for training large language models (LLMs) on multilingual code understanding and generation. The questions cover fundamental to advanced topics, supporting AI applications such as code completion, bug fixing, and programming reasoning. This structured dataset enhances model performance in natural language programming tasks and helps reinforce code logic skills in AI systems. All data complies with international privacy regulations including GDPR, CCPA, and PIPL.

Chinese coding questions dataset programming QA data parsed coding problems Python Java C++ dataset code generation LLM dataset Chinese code questions

100K English Instruction Tuning Dataset – General Domain SFT for LLM Fine-Tuning

100,000 Fine-Tuning Text Dataset for English LLM General Domain SFT is a high-quality supervised fine-tuning corpus designed to optimize instruction-following capabilities in large language models. Each data point is double-verified by experienced linguistic professionals and AI engineers to ensure relevance, clarity, and effectiveness in improving model alignment and response precision. The dataset supports instruction tuning tasks across a wide range of general knowledge domains and is compatible with leading open-source LLMs such as LLaMA, Falcon, GPT-NeoX, and Mistral. Ideal for use in alignment, safety tuning, and instruction-based generation enhancement, this dataset offers a robust foundation for model adaptation and performance improvement. All data complies with global data usage and privacy standards.

LLM fine-tuning dataset supervised fine-tuning SFT dataset English instruction tuning data general domain LLM data AI model fine-tuning instruction-following training data GPT tuning dataset

Agent Trajectory Dataset for Tool-Use Training and AI Agent Evaluation

This dataset covers office-based scenarios such as in-depth searches, data analysis, and industry research, encompassing complete multi-turn reasoning trajectories and tool-calling chains. It is designed to support the analysis of agent planning capabilities, research into tool selection strategies, and quality assessment, providing a structured benchmark for agent training and evaluation.

ai agent dataset agent training dataset agent trajectory dataset llm agent dataset tool use dataset agent evaluation datase

Long Context Reasoning Dataset – Multi-Language (EN/CH/KR) Benchmark for LLM Evaluation

This dataset is designed to tackle the core weaknesses of today's large language models when it comes to processing long documents and performing complex reasoning. It consists of 7,500 high-quality training examples across three languages—Chinese, English, and Korean. Each instance is built around a long-text passage and includes questions that require synthesizing information across paragraphs and documents, while following multi-step logical chains. The goal is to offer a thorough and rigorous evaluation framework that tests a model's ability to perceive long-range context, retrieve relevant information, construct sound reasoning paths, and trace evidence back to its source.

long context dataset long context reasoning dataset LLM long context dataset long document QA dataset multi hop reasoning dataset reasoning dataset for LLM multi step reasoning dataset

89K Japanese-Arabic Image-Text Dataset for Multimodal LLM Training

The dataset comprises a total of 89,007 samples, with each sample consisting of an image and a JSON document. The JSON document may contain image descriptions, visual question-answering pairs, OCR results extracted from the image, or visual question-answering pairs based on the OCR results. The dataset covers Arabic and Japanese languages and spans six major domains: ① Business and Finance, ②Coding and Computer Science, ③Law, Government, and Politics, ④Science, Technology, Engineering, and Mathematics (STEM), ⑤Society, Culture, Humanities, and Religion, ⑥ Sports, Lifestyle, and Leisure. Image classification accuracy (per-image) exceeds 95%; image-text matching accuracy is above 95%; OCR recognition accuracy (per-sentence) exceeds 95%. Suitable for multilingual OCR, multimodal Large Language Model (LLM) training, image captioning, and multilingual Visual Question Answering (VQA) tasks.

image text dataset multimodal dataset vision language dataset image caption dataset vlm training data multimodal llm dataset

Tailor Your Data Now

Why off-the-shelf Datasets

Copyright
Clear Coyright and Ready to Check
Security
Properly Authorized Secure to Use
Professional
Designed and produced by AI data experts
Diversity
Collected from a varity of real scenes
Cost Effective
More Cost-Efficient Than Tailored Data
Efficiency
Ready-To-Go Deliver in Seconds

Subscribe to our newsletter

Be the first to receive Nexdata latest product releases, data solutions and enterprise news.

Off-the-Shelf Datasets: All Category Datasets; Embodied AI Datasets; LLM Datasets; Computer Vision Datasets; Speech Recognition Datasets; Speech Synthesis Datasets; OCR Datasets; Pronunciation Dictionary; NLU Datasets

Data Service: 3D Point Cloud Data; Street View Data; OCR Data; Behavior Recognition Data; Identity Recognition Data; Speech Recognition Data; Speech Synthesis Data; Multimodal Data

Industries: Embodied AI; Generative AI; Autonomous Vehicles; AR/VR; Conversational AI; Smart Home; Retail; Intelligent Healthcare

Company: About Us; News; Partners; Quality & Security; Event
Links: OPENMPD; DataPlus; Datarade

Platform: Platform
Competition: Competition
Resources: Sponsored Datasets

Sharpen Your AI with Better Data

+1(626)594-5598

[email protected]

Sitemap Terms and Conditions

We use cookies to enhance your browsing experience, serve personalized ads or content, and analyze our traffic. By clicking "Accept All", you consent to our use of cookies.

a1a779ec-ab6d-4637-a089-7fd0a35c3d80