OCR Training Datasets – Handwriting & Document | Nexdata

en

Please fill in your name

Mobile phone format error

Please enter the telephone

Please enter your company name

Please enter your company email

Please enter the data requirement

Successful submission! Thank you for your support.

Format error, Please fill in again

Confirm

The data requirement cannot be less than 5 words and cannot be pure numbers

m.nexdata.datatang.com

Home > All Category Datasets > OCR Datasets

Data Type

All

29

Document

4

General Scenario

13

Handwriting

15

Internet image

1

Invoice

3

Others

4

Test paper

1

Table

1

Language

All

29

Chinese

6

English

4

Hindi

4

Japanese

8

Korean

7

Others

20

Vietnamese

4

Vietnamese OCR Dataset with Annotations and Transcriptions (4,995 Images)

This dataset contains 4,995 Vietnamese OCR images with annotations and text transcriptions. The data includes 258 natural scene images, 2,553 Internet images, and 2,184 document images. For line-level content annotation, quadrilateral bounding box annotations and text transcriptions are provided. For column-level content annotation, column-level quadrilateral bounding box annotation and text transcription are provided. The data can be used for tasks such as Vietnamese recognition in multiple scenes.

Vietnamese OCR dataset Vietnamese text recognition dataset Vietnamese OCR images Vietnamese OCR training data Vietnamese text detection dataset

Hindi OCR Dataset – 3,506 Images with Transcription

This dataset contains 3,506 Hindi OCR images, including 2,056 images of natural scenes, 1,103 Internet images and 347 document images. For line-level content annotation, line-level quadrilateral bounding box annotation and test transcription was adpoted; for column-level content annotation, column-level quadrilateral bounding box annotation and text transcription was adpoted. The dataset can be used for Hindi OCR, Hindi character recognition, and text detection tasks across multiple real-world scenes.

Hindi OCR dataset Hindi text recognition dataset Hindi scene text dataset Hindi image text dataset Hindi document OCR dataset

57,645 Images - Vertical OCR Data in Text Scenes

57,645 Images - Vertical OCR Data in Text Scenes. The collecting scenes of this dataset include street scenes, plaques, billboards, posters, decorations, art lettering, magazine covers etc. The language distribution includes Chinese and a few English. In this dataset, vertical -level rectangular bounding box (polygonal bounding box, parallelogram bounding box) annotation and transcription for the texts; non-vertical rectangular bounding box (polygonal bounding box, parallelogram bounding box) annotation and transcription for the texts. This dataset can be used for tasks such as multiple vertical text scenes OCR.

OCR Multiple scenes Multiple fonts

14,980 PPT Images – Multilingual OCR Dataset (8 Languages)

This dataset contains 14,980 PowerPoint slide images across 8 languages(French, Korean, Japanese, Spanish, German, Italian, Portuguese and Russian). This dataset includes multiple scenes, different photographic angles & distances, different light conditions. For annotation, each text line was annotated with quadrilateral bounding boxs and transcribed. The dataset can be used for tasks such as developing multilingual OCR systems.

multilingual PPT OCR dataset PowerPoint OCR dataset for AI OCR training dataset AI dataset for PowerPoint text extraction

Handwriting OCR Dataset – Japanese and Korean (22,163 Images)

This dataset contains handwritten text images collected from 100 individuals, including 50 Japanese, 49 Koreans and 1 Afghan. For different subjects, the corpus are different. The data diversity includes multiple cellphone models and different corpus. This dataset can be used for tasks such as handwriting OCR models, handwritten text recognition systems, and multilingual OCR pipelines

handwriting OCR dataset handwritten OCR dataset handwriting recognition dataset Korean handwriting OCR dataset multilingual handwriting OCR dataset Japanese handwriting OCR dataset

5,147 Images Japanese Handwriting OCR dataset

The text carrier are A4 paper, lined paper, quadrille paper, etc. The device is cellphone, the collection angle is eye-level angle. The dataset content includes Japanese composition, poetry, prose, news, stories, etc. For annotation, line-level quadrilateral bounding box annotation and transcription for the texts were annotated in the data.The dataset can be used for tasks such as Japanese OCR models and handwritten text recognition systems.

Japanese ocr dataset Japanese handwriting ocr dataset Japanese HTR Dataset OCR training dataset

Japanese Handwriting OCR Dataset – 4,538 Handwritten Text Images

This dataset contains 4,538 Japanese handwritten text images collected from 101 individual writers, written on A4 paper. The dataset content including social livelihood, entertainment, tour, sport, movie, composition and other fields. For annotation, character-level rectangular bounding box annotation and text transcription and line-level rectangular bounding box annotation and text transcription were adopted. The dataset can be used for for training and evaluating Japanese handwriting OCR models, handwritten text recognition systems, and document understanding pipelines.

handwriting OCR dataset handwritten OCR dataset handwriting recognition dataset Japanese handwriting OCR dataset

Multilingual OCR Dataset – 12 Languages Natural Scene Text

This dataset contains 105,941 images of natural scene text collected across multiple real-world environments, covering 12 languages, including 6 Asian languages and 6 European languages. The data covers multiple natural scenes, multiple photographic angles. For annotation, each image is annotated with line-level quadrilateral bounding boxes and accurate text transcriptions. The data can be used for multilingual OCR, scene text recognition, and cross-language OCR model training.

multilingual OCR dataset OCR dataset scene text OCR dataset natural scene OCR dataset multi-language OCR dataset text recognition dataset OCR annotation dataset line-level OCR dataset

426,687 Images - Multilingual OCR Dataset – Document & Scene Text

426,687 high-resolution images featuring multilingual Optical Character Recognition (OCR) data across both natural scenes and various document types. This dataset spans 20 languages, including Traditional Chinese, Simplified Chinese, Japanese, Korean, Thai, Vietnamese, Indonesian, Malay, Polish, and more. The data covers a wide range of real-world conditions—natural scenes, printed documents, handwritten notes, signs, and posters—captured from multiple countries and environments, with varied backgrounds, lighting conditions, and camera angles. All images are annotated for OCR tasks, making this dataset highly suitable for training deep learning models for text detection, recognition, and layout analysis in multi-language scenarios. The dataset complies with global data protection standards (GDPR, CCPA, PIPL), and is validated by leading AI enterprises for commercial and research applications.

multilingual OCR dataset scene text dataset document OCR images Chinese OCR Japanese OCR dataset Thai text recognition OCR dataset with annotation multi-language OCR document image dataset OCR training data

loading

Tailor Your Data Now

Why off-the-shelf Datasets

Copyright
Clear Coyright and Ready to Check
Security
Properly Authorized Secure to Use
Professional
Designed and produced by AI data experts
Diversity
Collected from a varity of real scenes
Cost Effective
More Cost-Efficient Than Tailored Data
Efficiency
Ready-To-Go Deliver in Seconds

Subscribe to our newsletter

Be the first to receive Nexdata latest product releases, data solutions and enterprise news.

Off-the-Shelf Datasets: All Category Datasets; Embodied AI Datasets; LLM Datasets; Computer Vision Datasets; Speech Recognition Datasets; Speech Synthesis Datasets; OCR Datasets; Pronunciation Dictionary; NLU Datasets

Data Service: 3D Point Cloud Data; Street View Data; OCR Data; Behavior Recognition Data; Identity Recognition Data; Speech Recognition Data; Speech Synthesis Data; Multimodal Data

Industries: Embodied AI; Generative AI; Autonomous Vehicles; AR/VR; Conversational AI; Smart Home; Retail; Intelligent Healthcare

Company: About Us; News; Partners; Quality & Security; Event
Links: OPENMPD; DataPlus; Datarade

Platform: Platform
Competition: Competition
Resources: Sponsored Datasets

Sharpen Your AI with Better Data

+1(626)594-5598

[email protected]

nexdata_ai facebook

nexdata_ai twitter

nexdata_ai linkedin

nexdata_ai youtube

Copyright © 2023 NEXDATA TECHNOLOGY INC

Sitemap Terms and Conditions

We use cookies to enhance your browsing experience, serve personalized ads or content, and analyze our traffic. By clicking "Accept All", you consent to our use of cookies.

a8f78754-9e2f-47ad-9c5d-b8a5992a8d1b