en

Please fill in your name

Mobile phone format error

Please enter the telephone

Please enter your company name

Please enter your company email

Please enter the data requirement

Successful submission! Thank you for your support.

Format error, Please fill in again

Confirm

The data requirement cannot be less than 5 words and cannot be pure numbers

m.nexdata.datatang.com

LLM Datasets

Instantly enhance AI model performance with high quality off-the-shelf datasets.

Type

All
36
Image Caption
16
SFT Datasets
5
Pre-training Text
17

Landmark Image Dataset – 200K Global Building Photos with Captions

This dataset contains 200,000 sets of images and bilingual captions (Chinese and English) featuring landmark buildings from over 20 countries, including the United States, United Kingdom, France, Germany, and Russia. Each set includes 1–10 images of a specific landmark, captured from different angles, distances, and time periods. The dataset covers approximately 80,000 domestic landmarks and 120,000 international ones. Types of landmarks include commercial buildings, ancient architecture, monuments, libraries, and scenic spots. Annotations include landmark country, city, location, category, and descriptive captions. This high-quality dataset is ideal for training models in landmark recognition, image classification, multilingual image captioning, and image-to-text retrieval.
landmark image dataset building recognition dataset global landmark image caption dataset bilingual image caption data Chinese-English caption dataset landmark classification dataset image-text dataset tourism landmark dataset cultural heritage image dataset image captioning for AI training

32M Science QA Dataset – Answers & Parsing for LLMs

32 million structured science questions covering mathematics, physics, chemistry, and biology across primary, middle, high school, and university levels. Each question entry includes a title, answer, solution parsing, question type, subject category, and corresponding grade level. The dataset is designed to support AI training tasks such as large language model development, subject-specific knowledge enhancement, machine reading comprehension, and question-answering systems. It provides a rich resource for educational NLP applications and has been validated for quality and completeness. All data complies with global data protection standards including GDPR, CCPA, and PIPL.
science question dataset STEM QA dataset math physics chemistry biology questions education NLP dataset AI training data structured question answer dataset academic QA dataset question parsing dataset K-12 science dataset university level questions dataset

6.03 Million - Majors Questions Text Parsing And Processing Data

Majors Questions Text Data, About 6.03 million majors questions with explanations and without explanations combined; Each question includes question type, question, answer, and explanation, some questions may have errors in question types; majors include Party Building, Law, Engineering, Civil Service, Computer Science, Economics, Graduate Studies, Medicine, Language, Self-Study, Comprehensive and Policy Essay Writing; question types include Multiple Choice, Single Choice, True/False, Fill in the Blanks, Short Answer, and Essay; this dataset can be used for tasks such as LLM training, chatgpt
Majors questions Text LLM

Large Language Model content safety considerations text data

Large Language Model content safety considerations text data, about 570,000 in total, this dataset can be used for tasks such as LLM training, chatgpt
Content safety Text LLM

6.9 million - Chinese Multi-disciplinary Questions Text Parsing And Processing Data

6.9 million - Chinese Multi-disciplinary Questions Text Parsing And Processing Data, including multiple disciplines in primary school, middle school, high school and university. Each questions contain title, answer, parse, type, subject, grade. The dataset can be used for large model subject knowledge enhancement tasks.
Chinese multi-disciplinary Questions LLM Text

Japanese Q&A Dataset from OKWAVE – 8.4M Questions

This dataset is collected from the Japanese OKWAVE Q&A platform and includes large-scale parsed and processed text data suitable for LLM training and Japanese natural language understanding. It contains structured fields such as questions, answers, categories, timestamps, user metadata, and supplementary explanations. As of April 2025, the dataset includes 8.4 million questions with 2.3 billion words, 27 million answers totaling 7.6 billion words, 15.5 million thank-you messages (1.7 billion words), and 2.1 million supplementary replies (360 million words). Continuously updated and rich in user-generated content, this dataset is ideal for building Japanese conversational AI, ChatGPT fine-tuning, question answering systems, text summarization, and semantic parsing models. All data complies with relevant data usage and privacy regulations.
Japanese Q&A dataset OKWAVE forum data Japanese language corpus Japanese dialogue dataset ChatGPT Japanese fine-tuning user-generated content question answer dataset

200,000 Multilingual Text Dataset in French, German, Spanish & Italian for NLP Training

This dataset contains 200,000 pieces of high-quality multilingual text content, evenly distributed across four languages: French, German, Spanish, and Italian (50,000 per language). The text samples span over 200 categories such as architecture, animals, automobiles, food & beverage, movies, zodiac signs, and cybersecurity. Designed to support a variety of natural language processing (NLP) tasks, this dataset is ideal for multilingual language model fine-tuning, cross-lingual classification, machine translation, and generative AI applications. All content is clean, well-formatted, and suitable for commercial and academic AI research.
multilingual text dataset French text dataset German text dataset Spanish text dataset Italian text data NLP multilingual training language model fine-tuning categorized text dataset LLM training data multilingual corpus

100K English Instruction Tuning Dataset – General Domain SFT for LLM Fine-Tuning

100,000 Fine-Tuning Text Dataset for English LLM General Domain SFT is a high-quality supervised fine-tuning corpus designed to optimize instruction-following capabilities in large language models. Each data point is double-verified by experienced linguistic professionals and AI engineers to ensure relevance, clarity, and effectiveness in improving model alignment and response precision. The dataset supports instruction tuning tasks across a wide range of general knowledge domains and is compatible with leading open-source LLMs such as LLaMA, Falcon, GPT-NeoX, and Mistral. Ideal for use in alignment, safety tuning, and instruction-based generation enhancement, this dataset offers a robust foundation for model adaptation and performance improvement. All data complies with global data usage and privacy standards.
LLM fine-tuning dataset supervised fine-tuning SFT dataset English instruction tuning data general domain LLM data AI model fine-tuning instruction-following training data GPT tuning dataset

300M Image-Caption Pairs – Large-Scale Vision-Language Dataset for AI Training

300 Million Pairs of High-Quality Image-Caption Dataset includes a large-scale collection of photographic and vector images paired with English textual descriptions. The complete image library comprises nearly 300 million images, with a curated subset of 100 million high-quality image-caption pairs available for generative AI and vision-language model training. All images are authentic and legally licensed works created by professional photographers. The dataset primarily features English captions with minimal Chinese, offering diverse scenes, objects, and compositions suitable for tasks such as image captioning, visual question answering (VQA), image-text retrieval, and multimodal foundation model pretraining. The dataset supports large-scale LLM and VLM applications and complies with global data privacy and copyright regulations, including GDPR, CCPA, and PIPL.
image-caption dataset image-text pairs vision-language data generative AI training dataset multimodal AI dataset image description data LLM vision data AI image-text alignment high-quality image data

loading

Tailor Your Data Now

Why off-the-shelf Datasets

  • Copyright

    Copyright

    Clear Coyright and Ready to Check
  • Security

    Security

    Properly Authorized Secure to Use
  • Professional

    Professional

    Designed and produced by AI data experts
  • Diversity

    Diversity

    Collected from a varity of real scenes
  • Cost Effective

    Cost Effective

    More Cost-Efficient Than Tailored Data
  • Efficiency

    Efficiency

    Ready-To-Go Deliver in Seconds
72f7afee-c851-4c0c-8bf2-9f27bc255bb5