en

Please fill in your name

Mobile phone format error

Please enter the telephone

Please enter your company name

Please enter your company email

Please enter the data requirement

Successful submission! Thank you for your support.

Format error, Please fill in again

Confirm

The data requirement cannot be less than 5 words and cannot be pure numbers

m.nexdata.datatang.com

LLM Datasets

Instantly enhance AI model performance with high quality off-the-shelf datasets.

Type

All
36
Image Caption
16
SFT Datasets
5
Pre-training Text
17

6.03 Million - Majors Questions Text Parsing And Processing Data

Majors Questions Text Data, About 6.03 million majors questions with explanations and without explanations combined; Each question includes question type, question, answer, and explanation, some questions may have errors in question types; majors include Party Building, Law, Engineering, Civil Service, Computer Science, Economics, Graduate Studies, Medicine, Language, Self-Study, Comprehensive and Policy Essay Writing; question types include Multiple Choice, Single Choice, True/False, Fill in the Blanks, Short Answer, and Essay; this dataset can be used for tasks such as LLM training, chatgpt
Majors questions Text LLM

Large Language Model content safety considerations text data

Large Language Model content safety considerations text data, about 570,000 in total, this dataset can be used for tasks such as LLM training, chatgpt
Content safety Text LLM

6.9 million - Chinese Multi-disciplinary Questions Text Parsing And Processing Data

6.9 million - Chinese Multi-disciplinary Questions Text Parsing And Processing Data, including multiple disciplines in primary school, middle school, high school and university. Each questions contain title, answer, parse, type, subject, grade. The dataset can be used for large model subject knowledge enhancement tasks.
Chinese multi-disciplinary Questions LLM Text

Japanese Q&A Dataset from OKWAVE – 8.4M Questions

This dataset is collected from the Japanese OKWAVE Q&A platform and includes large-scale parsed and processed text data suitable for LLM training and Japanese natural language understanding. It contains structured fields such as questions, answers, categories, timestamps, user metadata, and supplementary explanations. As of April 2025, the dataset includes 8.4 million questions with 2.3 billion words, 27 million answers totaling 7.6 billion words, 15.5 million thank-you messages (1.7 billion words), and 2.1 million supplementary replies (360 million words). Continuously updated and rich in user-generated content, this dataset is ideal for building Japanese conversational AI, ChatGPT fine-tuning, question answering systems, text summarization, and semantic parsing models. All data complies with relevant data usage and privacy regulations.
Japanese Q&A dataset OKWAVE forum data Japanese language corpus Japanese dialogue dataset ChatGPT Japanese fine-tuning user-generated content question answer dataset

200,000 Multilingual Text Dataset in French, German, Spanish & Italian for NLP Training

This dataset contains 200,000 pieces of high-quality multilingual text content, evenly distributed across four languages: French, German, Spanish, and Italian (50,000 per language). The text samples span over 200 categories such as architecture, animals, automobiles, food & beverage, movies, zodiac signs, and cybersecurity. Designed to support a variety of natural language processing (NLP) tasks, this dataset is ideal for multilingual language model fine-tuning, cross-lingual classification, machine translation, and generative AI applications. All content is clean, well-formatted, and suitable for commercial and academic AI research.
multilingual text dataset French text dataset German text dataset Spanish text dataset Italian text data NLP multilingual training language model fine-tuning categorized text dataset LLM training data multilingual corpus

100K English Instruction Tuning Dataset – General Domain SFT for LLM Fine-Tuning

100,000 Fine-Tuning Text Dataset for English LLM General Domain SFT is a high-quality supervised fine-tuning corpus designed to optimize instruction-following capabilities in large language models. Each data point is double-verified by experienced linguistic professionals and AI engineers to ensure relevance, clarity, and effectiveness in improving model alignment and response precision. The dataset supports instruction tuning tasks across a wide range of general knowledge domains and is compatible with leading open-source LLMs such as LLaMA, Falcon, GPT-NeoX, and Mistral. Ideal for use in alignment, safety tuning, and instruction-based generation enhancement, this dataset offers a robust foundation for model adaptation and performance improvement. All data complies with global data usage and privacy standards.
LLM fine-tuning dataset supervised fine-tuning SFT dataset English instruction tuning data general domain LLM data AI model fine-tuning instruction-following training data GPT tuning dataset

300M Image-Caption Pairs – Large-Scale Vision-Language Dataset for AI Training

300 Million Pairs of High-Quality Image-Caption Dataset includes a large-scale collection of photographic and vector images paired with English textual descriptions. The complete image library comprises nearly 300 million images, with a curated subset of 100 million high-quality image-caption pairs available for generative AI and vision-language model training. All images are authentic and legally licensed works created by professional photographers. The dataset primarily features English captions with minimal Chinese, offering diverse scenes, objects, and compositions suitable for tasks such as image captioning, visual question answering (VQA), image-text retrieval, and multimodal foundation model pretraining. The dataset supports large-scale LLM and VLM applications and complies with global data privacy and copyright regulations, including GDPR, CCPA, and PIPL.
image-caption dataset image-text pairs vision-language data generative AI training dataset multimodal AI dataset image description data LLM vision data AI image-text alignment high-quality image data

50,000 Image Editing Datasets – Object Removal, Addition & Modification Dataset for AI Training

50,000 Sets - Image Editing Dataset includes high-quality image pairs and annotations for object removal, addition, modification, and replacement. Editing targets span people, animals, products, plants, and landscapes across diverse real-world scenes. Each set includes clearly labeled annotations marking the regions and changes required based on editing instructions. This dataset is ideal for tasks such as image synthesis, AI-based photo editing, virtual scene generation, data augmentation, inpainting, and training image manipulation models. All data has been quality tested and complies with global privacy standards, including GDPR, CCPA, and PIPL.
image editing dataset image synthesis data object removal dataset object addition data AI image generation dataset virtual scene dataset annotated image editing data inpainting dataset AI training data for image manipulation generative image dataset

English Animal Healthcare Dataset – 250K Pet Medical Records

This dataset contains 250,000 English veterinary medical records covering multiple animal species. It includes structured information on hospital visits, treatment history, diagnostic results, allergy testing, vaccination records, and pet prescriptions. The data has been curated from real-world clinical scenarios, enhancing model performance in natural language understanding, AI medical reasoning, and veterinary applications. All data have been quality tested by AI enterprises and are fully compliant with GDPR, CCPA, and PIPL regulations to ensure user privacy and ethical use.
veterinary dataset pet medical records animal healthcare dataset veterinary clinical data pet prescription data allergy test dataset vaccination records dataset animal EMR data pet hospital data veterinary NLP dataset animal diagnosis dataset veterinary AI training data

loading

Tailor Your Data Now

Why off-the-shelf Datasets

  • Copyright

    Copyright

    Clear Coyright and Ready to Check
  • Security

    Security

    Properly Authorized Secure to Use
  • Professional

    Professional

    Designed and produced by AI data experts
  • Diversity

    Diversity

    Collected from a varity of real scenes
  • Cost Effective

    Cost Effective

    More Cost-Efficient Than Tailored Data
  • Efficiency

    Efficiency

    Ready-To-Go Deliver in Seconds
5a787292-8806-472d-8aec-8aca1ce78f50