en

Please fill in your name

Mobile phone format error

Please enter the telephone

Please enter your company name

Please enter your company email

Please enter the data requirement

Successful submission! Thank you for your support.

Format error, Please fill in again

Confirm

The data requirement cannot be less than 5 words and cannot be pure numbers

m.nexdata.datatang.com

OCR Datasets

Instantly enhance AI model performance with high quality off-the-shelf datasets.

Data Type

All
30
Document
4
General Scenario
13
Handwriting
15
Internet image
3
Invoice
3
Others
5
Test paper
1
Table
1

Language

All
30
Chinese
7
English
4
Hindi
4
Japanese
8
Korean
7
Others
20
Vietnamese
4

Chinese OCR Dataset in Natural Scenes – 222,289 Images

This dataset contains 222,289 images of Chinese text in natural scenes. The collecting scenes of this dataset include indoor and outdoor scenes. The data diversity includes multiple scenes, and multiple shooting angles. For annotation, we have annotated in line-level, word-level, and character-level and well content matched text transcription included. The dataset can be used for OCR tasks in natural scenes.
Chinese OCR dataset Chinese text recognition dataset word-level OCR dataset character-level OCR dataset

Form OCR Dataset – 9,497 Images of 10 Form Types

This dataset contains 9,497 images of 10 types of forms. Rectangular bounding boxes were adopted to annotate forms. The dataset can be used for tasks such as forms detection.
form OCR dataset document form image dataset OCR forms dataset form recognition dataset forms detection dataset

Primary School Math Exam Paper Image Dataset (17,561 Images)

This dataset contains 17,561 images of primary school mathematics papers. The data feature a pure color background. The data covers multiple question types, multiple types of test papers (math workbooks, test papers, competition test papers, etc.) and multiple grade levels. The dataset can be used for tasks such as intelligent scoring and homework guidance for primary school students.
math exam paper dataset primary school math dataset math worksheet image dataset math homework dataset intelligent grading dataset automatic scoring dataset homework grading dataset education OCR dataset

Vietnamese OCR Dataset with Annotations and Transcriptions (4,995 Images)

This dataset contains 4,995 Vietnamese OCR images with annotations and text transcriptions. The data includes 258 natural scene images, 2,553 Internet images, and 2,184 document images. For line-level content annotation, quadrilateral bounding box annotations and text transcriptions are provided. For column-level content annotation, column-level quadrilateral bounding box annotation and text transcription are provided. The data can be used for tasks such as Vietnamese recognition in multiple scenes.
Vietnamese OCR dataset Vietnamese text recognition dataset Vietnamese OCR images Vietnamese OCR training data Vietnamese text detection dataset

Large Korean & Hindi Scene Text OCR Dataset – 104,000+ Natural Images

This dataset contains 104,320 images of Korean and Hindi text in natural scenes. The collecting scenes include packaging, posters, tickets, reminders, menus, building signs, and similar real-world environments. The dataset features diverse scenes, multiple shooting angles and multiple lighting conditions. For annotation, line-level polygon bounding box (or tetragon or rectangular bounding boxes) annotation, transcription and text attributes (language type) for the texts; vertical-level polygon bounding box (or tetragon or rectangular bounding boxes) annotation, transcription and text attributes (language type) for the text. The dataset can be used for Korean and Hindi OCR tasks in natural scenes.
Korean OCR dataset Hindi OCR dataset Scene text OCR dataset Multilingual OCR images OCR training data for AI

Multilingual OCR Dataset – 12 Languages Natural Scene Text

This dataset contains 105,941 images of natural scene text collected across multiple real-world environments, covering 12 languages, including 6 Asian languages and 6 European languages. The data covers multiple natural scenes, multiple photographic angles. For annotation, each image is annotated with line-level quadrilateral bounding boxes and accurate text transcriptions. The data can be used for multilingual OCR, scene text recognition, and cross-language OCR model training.
multilingual OCR dataset OCR dataset scene text OCR dataset natural scene OCR dataset multi-language OCR dataset text recognition dataset OCR annotation dataset line-level OCR dataset

Handwriting OCR Dataset – Japanese and Korean (22,163 Images)

This dataset contains handwritten text images collected from 100 individuals, including 50 Japanese, 49 Koreans and 1 Afghan. For different subjects, the corpus are different. The data diversity includes multiple cellphone models and different corpus. This dataset can be used for tasks such as handwriting OCR models, handwritten text recognition systems, and multilingual OCR pipelines
handwriting OCR dataset handwritten OCR dataset handwriting recognition dataset Korean handwriting OCR dataset multilingual handwriting OCR dataset Japanese handwriting OCR dataset

Japanese Handwriting OCR Dataset – 4,538 Handwritten Text Images

This dataset contains 4,538 Japanese handwritten text images collected from 101 individual writers, written on A4 paper. The dataset content including social livelihood, entertainment, tour, sport, movie, composition and other fields. For annotation, character-level rectangular bounding box annotation and text transcription and line-level rectangular bounding box annotation and text transcription were adopted. The dataset can be used for for training and evaluating Japanese handwriting OCR models, handwritten text recognition systems, and document understanding pipelines.
handwriting OCR dataset handwritten OCR dataset handwriting recognition dataset Japanese handwriting OCR dataset

1,000 Images – Japanese Invoice OCR Dataset

This dataset contains 1,000 Japanese invoice images, it includes 500 images with basic virtual editing and 500 images with professional editing. Data diversity includes different invoice contents, different editing types, and multiple invoice formats. The company name, address, name, fax number, phone number and other sensitive information on the invoice have been virtually edited and are not real information. The data can be used for tasks such as invoice detection, recognition, and end-to-end OCR tasks.
Japanese invoice OCR dataset invoice OCR dataset invoice OCR data invoice recognition dataset OCR training data

loading

Tailor Your Data Now

Why off-the-shelf Datasets

  • Copyright

    Copyright

    Clear Coyright and Ready to Check
  • Security

    Security

    Properly Authorized Secure to Use
  • Professional

    Professional

    Designed and produced by AI data experts
  • Diversity

    Diversity

    Collected from a varity of real scenes
  • Cost Effective

    Cost Effective

    More Cost-Efficient Than Tailored Data
  • Efficiency

    Efficiency

    Ready-To-Go Deliver in Seconds
c9cd207c-8929-4f9e-95bb-6829380510f4