en

Please fill in your name

Mobile phone format error

Please enter the telephone

Please enter your company name

Please enter your company email

Please enter the data requirement

Successful submission! Thank you for your support.

Format error, Please fill in again

Confirm

The data requirement cannot be less than 5 words and cannot be pure numbers

m.nexdata.datatang.com

LLM Datasets

Instantly enhance AI model performance with high quality off-the-shelf datasets.

Type

All
37
Image Caption
17
SFT Datasets
5
Pre-training Text
17

50,000 Image Editing Datasets – Object Removal, Addition & Modification Dataset for AI Training

50,000 Sets - Image Editing Data. The editing types include human attribute editing, image semantic editing, and image structure editing. The editing targets cover scenes such as people, animals, goods, plants, and landscapes. In terms of annotation, based on the editing instructions, the targets that need to be edited in the image are edited. The data can be used for tasks such as image synthesis, data augmentation, and virtual scene generation.
image editing dataset image synthesis data object removal dataset object addition data AI image generation dataset virtual scene dataset annotated image editing data inpainting dataset AI training data for image manipulation generative image dataset

120K Multimodal QA Dataset – Visual & Text Reasoning

This dataset includes 120,000 multimodal question-answer pairs across six major academic disciplines, including medicine, engineering, art, science, and more. Each QA pair combines textual and visual content—such as charts, diagrams, blueprints, and artworks—crafted to test logical reasoning, cross-modal understanding, and domain-specific knowledge. All questions have been reviewed by subject-matter experts to ensure academic quality and accuracy. Ideal for training multimodal large language models (MLLMs), visual question answering (VQA) systems, and AI applications requiring deep contextual reasoning, this dataset supports fine-tuning tasks like knowledge grounding, visual-text alignment, and decision-making. All data complies with GDPR, CCPA, and PIPL regulations, ensuring ethical use and privacy protection.
multimodal dataset VQA dataset multimodal QA data reasoning dataset for AI image-text QA dataset domain-specific AI training data chart reasoning dataset LLM multimodal training data

100,145 Sets of ICONS Image Caption Data

100,145 Sets of ICONS Image Caption Data. The data includes two major categories of icons, namely 3D Style Icons and Vector Illustration Icons, totaling 16 subcategories. In terms of annotation, the icon descriptions are in Chinese, with a description length of about 30 characters. The data can be used for tasks such as graphic recognition and interface interaction.
ICONS Image caption

100K English Instruction Tuning Dataset – General Domain SFT for LLM Fine-Tuning

100,000 Fine-Tuning Text Dataset for English LLM General Domain SFT is a high-quality supervised fine-tuning corpus designed to optimize instruction-following capabilities in large language models. Each data point is double-verified by experienced linguistic professionals and AI engineers to ensure relevance, clarity, and effectiveness in improving model alignment and response precision. The dataset supports instruction tuning tasks across a wide range of general knowledge domains and is compatible with leading open-source LLMs such as LLaMA, Falcon, GPT-NeoX, and Mistral. Ideal for use in alignment, safety tuning, and instruction-based generation enhancement, this dataset offers a robust foundation for model adaptation and performance improvement. All data complies with global data usage and privacy standards.
LLM fine-tuning dataset supervised fine-tuning SFT dataset English instruction tuning data general domain LLM data AI model fine-tuning instruction-following training data GPT tuning dataset

300M Image-Caption Pairs – Large-Scale Vision-Language Dataset for AI Training

300 Million Pairs of High-Quality Image-Caption Dataset includes a large-scale collection of photographic and vector images paired with English textual descriptions. The complete image library comprises nearly 300 million images, with a curated subset of 100 million high-quality image-caption pairs available for generative AI and vision-language model training. All images are authentic and legally licensed works created by professional photographers. The dataset primarily features English captions with minimal Chinese, offering diverse scenes, objects, and compositions suitable for tasks such as image captioning, visual question answering (VQA), image-text retrieval, and multimodal foundation model pretraining. The dataset supports large-scale LLM and VLM applications and complies with global data privacy and copyright regulations, including GDPR, CCPA, and PIPL.
image-caption dataset image-text pairs vision-language data generative AI training dataset multimodal AI dataset image description data LLM vision data AI image-text alignment high-quality image data

2.4M Korean Exam Question Dataset for AI Training

This dataset contains 2.4 million structured Korean exam questions covering primary, middle, and high school subjects including Korean, Mathematics, English, Social Studies, Science, Physics, Chemistry, Biology, History, and Geography. Each record includes question type (multiple-choice, fill-in-the-blank, true/false, short answer), the question itself, standard answers, and detailed explanations. The data is professionally annotated and categorized by subject and academic level, making it ideal for training AI models in educational applications such as question answering systems, tutoring bots, academic reasoning, and subject-level knowledge enhancement. It is widely applicable for natural language processing tasks involving structured QA, exam-style NLP training, and educational content generation. All data is collected and processed in compliance with GDPR, CCPA, and PIPL standards, ensuring privacy and legal integrity throughout the lifecycle.
korean exam dataset education dataset test question dataset multiple choice QA dataset K-12 school question data AI training dataset for education NLP exam data structured Korean question dataset school subject QA dataset

7 Million Sets - High-Quality Video Caption Dataset

7 million global genuine high-quality videos. All are genuine video works released by photographers around the world. 6 million of them are described in English and 1 million in Chinese. They cover a variety of categories such as people, landscapes, animals, etc. The resolution is above 1080p.
AIGC English description Chinese description Multiple video categories Multiple descriptions

Multilingual Grammar Correction Dataset – 480K Parallel Texts (DE, ES, FR, IT)

This dataset focuses on the four major European languages (French, German, Spanish, Italian) and contains 480000 pairs of original and corrected text pairs. Each piece of data is presented in JSON format, including two fields: input (raw text) and output (corrected text), which can assist in natural language processing, machine translation, and language teaching research.
German French Spanish Italian proofreading Multilingual Grammar Correction Dataset Grammar Correction Dataset

288 Million 3D Models & Scenes Dataset for AI and Simulation

Massive 3D Models & Scenes Dataset includes 270 million sets of 3D models and 18 million 3D scenes. 3D models cover conventional models, interactive models, and physics-enhanced models with various objects in indoor residential environments. 3D scenes cover indoor home decoration scenarios and commercial space environments. This dataset can be used for tasks like 3D asset generation, virtual environment simulation, AI model training, and industrial design applications.
3D models dataset 3D scenes dataset indoor 3D environment dataset commercial 3D space dataset physics-enhanced 3D models interactive 3D models dataset 3D assets generation dataset simulation training environment dataset virtual environment 3D data large-scale 3D AI dataset

loading

Tailor Your Data Now

Why off-the-shelf Datasets

  • Copyright

    Copyright

    Clear Coyright and Ready to Check
  • Security

    Security

    Properly Authorized Secure to Use
  • Professional

    Professional

    Designed and produced by AI data experts
  • Diversity

    Diversity

    Collected from a varity of real scenes
  • Cost Effective

    Cost Effective

    More Cost-Efficient Than Tailored Data
  • Efficiency

    Efficiency

    Ready-To-Go Deliver in Seconds
10341eff-abce-4420-a13a-e80895e70df7