Please fill in your name

Mobile phone format error

Please enter the telephone

Please enter your company name

Please enter your company email

Please enter the data requirement

Successful submission! Thank you for your support.

Format error, Please fill in again

Confirm

The data requirement cannot be less than 5 words and cannot be pure numbers

m.nexdata.datatang.com

Home > All Category Datasets > LLM Datasets > 570K Chinese LLM Content Safety Dataset

570K Chinese LLM Content Safety Dataset

llm content safety dataset

ai content safety data

content safety training data

llm safety dataset

ai moderation dataset

harmful content dataset

llm alignment dataset

This dataset containing approximately 570,000 question–answer pairs. The data covers 31 established content safety categories (CAC) along with additional emerging risk categories. All samples are written by professional annotators, this dataset can be used for tasks such as large language model training, safety evaluation, and supervised fine-tuning focused on content moderation and risk handling.

This is a paid datasets for commercial use, research purpose and more. Licensed ready made datasets help jump-start AI projects.

Specifications

Data content

Large Language Model content safety considerations text data

Data size

About 570,000 sets of question and answer data; covering 31 categories of CAC + other new categories

Collecting type

41 major categories

Collecting method

written by professional annotators

Storage format

Excel

Language

Chinese

Sample

Recommended Dataset

1.51 Million Sets of Single-image and Multi-image Fusion Image Editing Data

1.51 Million Sets of Single-image and Multi-image Fusion Image Editing Data. Editing types include 500,000 sets of portrait/object consistency editing, 300,000 sets of structural edits, 210,000 sets of mixed editing, and 450,000 sets of spatial editing, and 50,000 sets of style transfer editing. The editing targets cover scenes such as people, animals, goods, plants, and landscapes. In terms of annotation, the targets that need to be edited in the image are edited according to the editing instructions. The data can be used for tasks such as image synthesis, data augmentation, and virtual scene generation.

Image Editing Multi-image Fusion

100K English Instruction Tuning Dataset – General Domain SFT for LLM Fine-Tuning

100,000 Fine-Tuning Text Dataset for English LLM General Domain SFT is a high-quality supervised fine-tuning corpus designed to optimize instruction-following capabilities in large language models. Each data point is double-verified by experienced linguistic professionals and AI engineers to ensure relevance, clarity, and effectiveness in improving model alignment and response precision. The dataset supports instruction tuning tasks across a wide range of general knowledge domains and is compatible with leading open-source LLMs such as LLaMA, Falcon, GPT-NeoX, and Mistral. Ideal for use in alignment, safety tuning, and instruction-based generation enhancement, this dataset offers a robust foundation for model adaptation and performance improvement. All data complies with global data usage and privacy standards.

LLM fine-tuning dataset supervised fine-tuning SFT dataset English instruction tuning data general domain LLM data AI model fine-tuning instruction-following training data GPT tuning dataset

50,000 Image Editing Datasets – Object Removal, Addition & Modification Dataset for AI Training

50,000 Sets - Image Editing Data. The editing types include human attribute editing, image semantic editing, and image structure editing. The editing targets cover scenes such as people, animals, goods, plants, and landscapes. In terms of annotation, based on the editing instructions, the targets that need to be edited in the image are edited. The data can be used for tasks such as image synthesis, data augmentation, and virtual scene generation.

image editing dataset image synthesis data object removal dataset object addition data AI image generation dataset virtual scene dataset annotated image editing data inpainting dataset AI training data for image manipulation generative image dataset

100K Chinese LLM Instruction-Following Dataset

This dataset contains 50-400 words, with each prompt containing at least three constraints to train and improve the instruction-following performance of large models. Categories cover generation (news releases, interview outlines, copywriting, manuscript proofreading, Chinese-English essays, grammar learning, research reports, study plans, poetry writing, food descriptions, advertising copy, sales scripts, official document writing assistance, official document review, policy document Q&A, etc.), rewriting (sentence rewriting, text correction, sentence merging, copywriting simplification), summarizing (content summarization), and extraction (event element extraction, opinion extraction, keyword extraction, stance extraction, entity extraction). All prompts are manually compiled to ensure diverse coverage. The dataset is suitable for systematic benchmarking and model assessment.

LLM evaluation dataset Chinese LLM instruction following dataset Instruction-following prompt dataset Prompt benchmark dataset LLM

Tell Us Your Special Needs

Current Project Maturity

Early exploration (no concrete specs yet)

Defined goals, need professional guidance

Active development or optimization phase

Data & labeling experts with clear specifications

Full Name *

Contact Phone No.*

Company name *

Company Email *

Data Requirements *

By submitting, I agree to the Privacy Protection

Submit

Subscribe to our newsletter

Be the first to receive Nexdata latest product releases, data solutions and enterprise news.

Off-the-Shelf Datasets: All Category Datasets; LLM Datasets; Computer Vision Datasets; Speech Recognition Datasets; Speech Synthesis Datasets; OCR Datasets; Pronunciation Dictionary; NLU Datasets

Data Service: 3D Point Cloud Data; Street View Data; OCR Data; Behavior Recognition Data; Identity Recognition Data; Speech Recognition Data; Speech Synthesis Data; Multimodal Data

Industries: Embodied AI; Generative AI; Autonomous Vehicles; AR/VR; Conversational AI; Smart Home; Retail; Intelligent Healthcare

Company: About Us; News; Partners; Quality & Security; Event
Links: OPENMPD; DataPlus; Datarade

Platform: Platform
Competition: Competition
Resources: Sponsored Datasets

Sharpen Your AI with Better Data

+1(626)594-5598

[email protected]

Sitemap Terms and Conditions

We use cookies to enhance your browsing experience, serve personalized ads or content, and analyze our traffic. By clicking "Accept All", you consent to our use of cookies.

0a40aaaf-671e-4c2f-962c-a9484a7d1744

da55ed59-a354-46ff-899b-84b6884a9572

570K Chinese LLM Content Safety Dataset

llm content safety dataset ai content safety data content safety training data llm safety dataset ai moderation dataset harmful content dataset llm alignment dataset

Current Project Maturity

llm content safety dataset

ai content safety data

content safety training data

llm safety dataset

ai moderation dataset

harmful content dataset

llm alignment dataset