200,000 Multilingual Text Dataset in French, German, Spanish & Italian for NLP Training

multilingual text dataset

French text dataset

German text dataset

Spanish text dataset

Italian text data

NLP multilingual training

language model fine-tuning

categorized text dataset

LLM training data

multilingual corpus

This dataset contains 200,000 pieces of high-quality multilingual text content, evenly distributed across four languages: French, German, Spanish, and Italian (50,000 per language). The text samples span over 200 categories such as architecture, animals, automobiles, food & beverage, movies, zodiac signs, and cybersecurity. Designed to support a variety of natural language processing (NLP) tasks, this dataset is ideal for multilingual language model fine-tuning, cross-lingual classification, machine translation, and generative AI applications. All content is clean, well-formatted, and suitable for commercial and academic AI research.

This is a paid datasets for commercial use, research purpose and more. Licensed ready made datasets help jump-start AI projects.

Recommended Dataset

Long Context Reasoning Dataset – Multi-Language (EN/CH/KR) Benchmark for LLM Evaluation

This dataset is designed to tackle the core weaknesses of today's large language models when it comes to processing long documents and performing complex reasoning. It consists of 7,500 high-quality training examples across three languages—Chinese, English, and Korean. Each instance is built around a long-text passage and includes questions that require synthesizing information across paragraphs and documents, while following multi-step logical chains. The goal is to offer a thorough and rigorous evaluation framework that tests a model's ability to perceive long-range context, retrieve relevant information, construct sound reasoning paths, and trace evidence back to its source.

long context dataset long context reasoning dataset LLM long context dataset long document QA dataset multi hop reasoning dataset reasoning dataset for LLM multi step reasoning dataset

1.5 Million English STEM Test Questions Dataset – Science and Engineering Subjects

This dataset contains 1.5 million English science and engineering test questions, including mathematics, physics, chemistry, biology, and other STEM subjects at the university level. Each questions contain title, answer, parse, type, subject, grade. The dataset can be used for large model subject knowledge enhancement tasks.

Question-answer dataset Question processing dataset Labeled STEM exam dataset Large-scale test question dataset English STEM test question dataset

Japanese Q&A Dataset from OKWAVE – 8.4M Questions

This dataset is collected from the Japanese OKWAVE Q&A platform and includes large-scale parsed and processed text data suitable for LLM training and Japanese natural language understanding. It contains structured fields such as questions, answers, categories, timestamps, user metadata, and supplementary explanations. As of April 2025, the dataset includes 8.4 million questions with 2.3 billion words, 27 million answers totaling 7.6 billion words, 15.5 million thank-you messages (1.7 billion words), and 2.1 million supplementary replies (360 million words). Continuously updated and rich in user-generated content, this dataset is ideal for building Japanese conversational AI, ChatGPT fine-tuning, question answering systems, text summarization, and semantic parsing models. All data complies with relevant data usage and privacy regulations.

Japanese Q&A dataset OKWAVE forum data Japanese language corpus Japanese dialogue dataset ChatGPT Japanese fine-tuning user-generated content question answer dataset

6.9M Chinese Educational QA Dataset for LLM Training (K12 to University)

This dataset contains 6.9 million Chinese educational question-answer pairs covering multiple disciplines from primary school to university levels, including mathematics, science, and other academic subjects. Each question includes a title, answer, explanation, question type, subject, and grade level. The dataset is suitable for LLM instruction tuning, educational AI systems, tutoring platforms, math reasoning models, and general knowledge enhancement tasks.

educational dataset math dataset stem dataset k12 dataset instruction dataset for llm

1M Chinese Coding Questions Dataset – Python/Java/C++

This dataset contains 1 million Chinese programming questions with corresponding answers, detailed parses (explanations), and programming language labels. It includes a wide range of questions in C, C++, Python, Java, and JavaScript, making it ideal for training large language models (LLMs) on multilingual code understanding and generation. The questions cover fundamental to advanced topics, supporting AI applications such as code completion, bug fixing, and programming reasoning. This structured dataset enhances model performance in natural language programming tasks and helps reinforce code logic skills in AI systems. All data complies with international privacy regulations including GDPR, CCPA, and PIPL.

Chinese coding questions dataset programming QA data parsed coding problems Python Java C++ dataset code generation LLM dataset Chinese code questions

32M Science QA Dataset – Answers & Parsing for LLMs

32 million structured science questions covering mathematics, physics, chemistry, and biology across primary, middle, high school, and university levels. Each question entry includes a title, answer, solution parsing, question type, subject category, and corresponding grade level. The dataset is designed to support AI training tasks such as large language model development, subject-specific knowledge enhancement, machine reading comprehension, and question-answering systems. It provides a rich resource for educational NLP applications and has been validated for quality and completeness. All data complies with global data protection standards including GDPR, CCPA, and PIPL.

science question dataset STEM QA dataset math physics chemistry biology questions education NLP dataset AI training data structured question answer dataset academic QA dataset question parsing dataset K-12 science dataset university level questions dataset

114K Chinese Olympiad Questions Dataset – STEM & QA

This dataset includes 114,000 structured Chinese academic contest questions from primary, middle, and high school levels. Subjects covered include mathematics, physics, chemistry, and biology. Each question is annotated with the question title, correct answer, parse (explanation), subject, grade level, and question type, making it highly suitable for fine-tuning educational large language models (LLMs) and intelligent tutoring systems. The data mirrors real-world Olympiad and competitive test formats in China, providing rich material for enhancing subject-specific knowledge and reasoning capabilities in AI systems. All data complies with global privacy regulations including GDPR, CCPA, and PIPL.

Chinese exam dataset Chinese contest questions Olympiad question dataset parsed Chinese QA AI dataset for Chinese education NLP training Chinese STEM math Olympiad Chinese education QA data

2.4M Korean Exam Question Dataset for AI Training

This dataset contains 2.4 million structured Korean exam questions covering primary, middle, and high school subjects including Korean, Mathematics, English, Social Studies, Science, Physics, Chemistry, Biology, History, and Geography. Each record includes question type (multiple-choice, fill-in-the-blank, true/false, short answer), the question itself, standard answers, and detailed explanations. The data is professionally annotated and categorized by subject and academic level, making it ideal for training AI models in educational applications such as question answering systems, tutoring bots, academic reasoning, and subject-level knowledge enhancement. It is widely applicable for natural language processing tasks involving structured QA, exam-style NLP training, and educational content generation. All data is collected and processed in compliance with GDPR, CCPA, and PIPL standards, ensuring privacy and legal integrity throughout the lifecycle.

korean exam dataset education dataset test question dataset multiple choice QA dataset K-12 school question data AI training dataset for education NLP exam data structured Korean question dataset school subject QA dataset

200,000 Multilingual Text Dataset in French, German, Spanish & Italian for NLP Training

multilingual text dataset French text dataset German text dataset Spanish text dataset Italian text data NLP multilingual training language model fine-tuning categorized text dataset LLM training data multilingual corpus

Current Project Maturity

multilingual text dataset

French text dataset

German text dataset

Spanish text dataset

Italian text data

NLP multilingual training

language model fine-tuning

categorized text dataset

LLM training data

multilingual corpus