en

Please fill in your name

Mobile phone format error

Please enter the telephone

Please enter your company name

Please enter your company email

Please enter the data requirement

Successful submission! Thank you for your support.

Format error, Please fill in again

Confirm

The data requirement cannot be less than 5 words and cannot be pure numbers

LLM Datasets

Instantly enhance AI model performance with high quality off-the-shelf datasets.

Type

All Image Caption
8
SFT Datasets
1
Pre-training Text
4

100,000 Instruction-Following Evaluation SFT for Chinese LLM Text Data

100,000 Instruction-Following Evaluation SFT for Chinese LLM Text Data. Between 50 and 400 words, with no fewer than 3 constraints in each prompt.All prompt are manually written to satisfy the diversity of coverage.
LLM Instruction-Following SFT

Large Language Model content safety considerations text data

Large Language Model content safety considerations text data, about 500,000 in total, this dataset can be used for tasks such as LLM training, chatgpt
Large Language Model content safety considerations text data LLM Large Language Model Large Model chatgpt data

203,029 Groups - Chinese Medical Question Answering Data

The data contains 203,029 groups Chinese question answering data between doctors and patients of different diseases.
Medical question answering disease

2 Million Pairs Image Caption Data Of General Scenes

2 million pairs of images and descriptions, the pictures cover various categories, including landscapes, animals, flowers and trees, people, cars, sports, industry, and architecture, along with an aesthetic subset. They depict the overall scene of the image, the details within the scene, and the emotions conveyed by the image. The description is provided in both English and Chinese languages.
Text description multi-modality general scene data set English caption Chinese caption

830,276 groups - Multi-Round Interpersonal Dialogues Text Data

This database is the interactive text corpus of real users on the mobile phone. The database itself has been desensitized to ensure of no private information of the user's (A and B are the codes to replace the sender and receiver, and sensitive information such as cellphone number and user name are replaced with '* * *'). This database can be used for tasks such as natural language understanding.
Interactive text corpus database text corpus database

90,000 sets – Multi-domain Customer Service Dialogue Text Data

Multi-domain Customer Service Dialogue Text Data, 90,000 sets in total; spanning multiple domains, including telecommunications, e-commerce, and financial, lifestyle, business, education, healthcare, and entertainment; Each set of data consists of single or multi-turn conversations; this dataset can be used for tasks such as LLM training, chatgpt
Customer Service Dialogue text data telecommunications topics data commerce topics data finance topics data LLM data Large Language Model data chatgpt data

700,000 Sets Image Caption Data Of General Scenes

700,000 sets of images and descriptions,the types of pictures include landscapes, animals, flowers and trees, people, cars, sports, industries, and buildings. Category and an aesthetic subset, each image has no less than two descriptions, each with one sentence; a small number of images have only one description, and the description languages are English and Chinese
Text description multi-modality general scene data set English caption Chinese caption

11,000 Image & Video Caption Data of Human Action

11,000 Image & Video caption data of human action contains 10,000 images and 10,000videos of various human behaviors in different seasons and different shooting angles, including indoor scenes and outdoor scenes. The description language is English, mainly describing the gender, age, clothing, behavior description and body movements of the characters.
AIGC human behavior data behavior recognition data human behavior recognition data human detection data

20,011 Image Caption Data of OCR in Natural Scenes

20,011 Image Caption Data of OCR in Natural Scenes, including Asian and European languages, a total of 14 languages, the collection environment includes shop plaques, stop signs, posters, road signs and other scenes, including a variety of shooting angles. The description language is English, which mainly describes the text arrangement, text content, color and other information.
AIGC English caption OCR caption multilingual OCR data multilingual OCR data OCR data OCR dataset

loading

Tailor Your Data Now

Why off-the-shelf Datasets

  • Copyright

    Copyright

    Clear Coyright and Ready to Check
  • Security

    Security

    Properly Authorized Secure to Use
  • Professional

    Professional

    Designed and produced by AI data experts
  • Diversity

    Diversity

    Collected from a varity of real scenes
  • Cost Effective

    Cost Effective

    More Cost-Efficient Than Tailored Data
  • Efficiency

    Efficiency

    Ready-To-Go Deliver in Seconds
065ec485-8537-49b0-8a8a-63fc2b3941f9