20,011 Image Caption Data of OCR in Natural Scenes

AIGC

English caption

OCR caption

Multiple shooting angles

Multinational scenes

20,011 Image Caption Data of OCR in Natural Scenes, including Asian and European languages, a total of 14 languages, the collection environment includes shop plaques, stop signs, posters, road signs and other scenes, including a variety of shooting angles. The description language is English, which mainly describes the text arrangement, text content, color and other information.

This is a paid datasets for commercial use, research purpose and more. Licensed ready made datasets help jump-start AI projects.

Specifications

Data size

20,011 pictures, 20,011descriptions

Language distribution

Asian languages: Korean, Indonesian, Malay, Vietnamese, Thai, Chinese, Japanese European languages: French, German, Italian, Portuguese, Russian, Spanish, English

Collection environment

including store plaques, stop signs, posters, road signs, prompts and other scenes

Collection diversity

including 14 languages, various natural scenes, and multiple shooting angles

Data format

image format is .jpg, text format is .txt

Collection equipment

mobile phone, camera

Description language

English

Text length

in principle, 30~60 words, usually 3-5 sentences

Main description content

text arrangement, text content, color, scene

Main deAccuracy ratescription content

the proportion of correctly labeled images is not less than 97%

Recommended Dataset

Landmark Image Dataset – 200K Global Building Photos with Captions

This dataset contains 200,000 sets of images and bilingual captions (Chinese and English) featuring landmark buildings from over 20 countries, including the United States, United Kingdom, France, Germany, and Russia. Each set includes 1–10 images of a specific landmark, captured from different angles, distances, and time periods. The dataset covers approximately 80,000 domestic landmarks and 120,000 international ones. Types of landmarks include commercial buildings, ancient architecture, monuments, libraries, and scenic spots. Annotations include landmark country, city, location, category, and descriptive captions. This high-quality dataset is ideal for training models in landmark recognition, image classification, multilingual image captioning, and image-to-text retrieval.

landmark image dataset building recognition dataset global landmark image caption dataset bilingual image caption data Chinese-English caption dataset landmark classification dataset image-text dataset tourism landmark dataset cultural heritage image dataset image captioning for AI training

120K Multimodal QA Dataset – Visual & Text Reasoning

This dataset includes 120,000 multimodal question-answer pairs across six major academic disciplines, including medicine, engineering, art, science, and more. Each QA pair combines textual and visual content—such as charts, diagrams, blueprints, and artworks—crafted to test logical reasoning, cross-modal understanding, and domain-specific knowledge. All questions have been reviewed by subject-matter experts to ensure academic quality and accuracy. Ideal for training multimodal large language models (MLLMs), visual question answering (VQA) systems, and AI applications requiring deep contextual reasoning, this dataset supports fine-tuning tasks like knowledge grounding, visual-text alignment, and decision-making. All data complies with GDPR, CCPA, and PIPL regulations, ensuring ethical use and privacy protection.

multimodal dataset VQA dataset multimodal QA data reasoning dataset for AI image-text QA dataset domain-specific AI training data chart reasoning dataset LLM multimodal training data

100,000 Sets of ICONS Image Caption Data

100,000 Sets of ICONS Image Caption Data. The data includes two major categories of icons, namely 3D Style Icons and Vector Illustration Icons, totaling 17 subcategories. In terms of annotation, the icon descriptions are in Chinese, with a description length of about 30 characters. The data can be used for tasks such as graphic recognition and interface interaction.

ICONS Image caption

11,000 Image & Video Caption Data of Human Action

11,000 Image & Video caption data of human action contains 10,000 images and 10,000videos of various human behaviors in different seasons and different shooting angles, including indoor scenes and outdoor scenes. The description language is English, mainly describing the gender, age, clothing, behavior description and body movements of the characters.

AIGC Multi-modal English caption Different age groups Different lighting Different collection environments Different seasons of clothing Various human behaviors

10,000 Image Caption Data of Gestures

10,000 Image caption data of gestures, mainly for young and middle-aged people, the collection environment includes indoor scenes and outdoor scenes, including various collection environments, various seasons, and various collection angles. The description language is English, mainly describing hand characteristics such as hand movements, gestures, image acquisition angles, gender, age, etc.

AIGC Multi-modality English caption Multiple shooting angles Multiple seasons Multiple scenes

10,100 Image Caption Data of Human Face

10,100 Image caption data of human face includes multiple races under the age of 18, 18~45 years old, 46~60 years old, and over 60 years old; the collection scene is rich, including indoor scenes and outdoor scenes; the image content is rich, including wearing masks, glasses, wearing headphones, facial expressions, gestures, and adversarial examples. The language of the text description is English, which mainly describes the race, gender, age, shooting angle, lighting and diversity content, etc.

AIGC English caption Face description Multiple scenes Multiple seasons Multiple races

21,998Image Caption Data of Vehicles

21875 Image Caption Data Of Vehicles covers various types of cars, SUVs, MPVs, trucks, and buses. Surveillance cameras are used to collect outdoor roads for multiple periods of time, mainly describing the types of vehicles. Information such as color, vehicle orientation, scene, etc., the description language is English.

Multiple models Multiple vehicle colors Multiple vehicle brands Different time periods English caption Vehicle attribute caption AIGC

100,000 Sets of ICONS Image Caption Data

ICONS Image caption