500,000 Images – Multilingual OCR Dataset in 21 Languages

multilingual OCR dataset

scene text recognition data

document OCR dataset

electronic screen OCR data

OCR dataset 21 languages

AI OCR training data

text recognition dataset

This dataset covers 21 languages, with 20,000 to 25,000 images per language. The data includes natural scenes, document photography scenes, and electronic scenes. The data diversity includes various data types, multiple shooting angles, and multiple languages. In terms of annotation, quadrilateral or polygonal at the row (column) level and content transcription at the row (column) level are adopted. This dataset can be use for multilingual optical character recognition (OCR) and text detection tasks.

This is a paid datasets for commercial use, research purpose and more. Licensed ready made datasets help jump-start AI projects.

Specifications

Data size

500,000 images, the quantity of each language is distributed between 20,000 and 25,000

Language distribution

German, French, Portuguese, Italian, Spanish, Indonesian, Russian, Japanese, Korean, Vietnamese, Polish, Czech, Turkish, Filipino, Dutch, Hindi, Malay, Kazakh, Slovak, Romanian, Uzbek

Collection environment

(1)Document photograph scenes: books, newspapers, various types of cards, receipts, etc. (2) Natural scenes: posters, warnings signs, road signs, food packaging, billboards, bus stops, signs, etc.(3) Electronic scenes: screenshots from mobile phones, computer screenshots, electronic documents

Document photograph scenes

books, newspapers, various types of cards, receipts, etc.

Natural scenes

posters, warnings signs, road signs, food packaging, billboards, bus stops, signs, etc.

Electronic scenes

screenshots from mobile phones, computer screenshots, electronic documents

Diversity of collection

multiple data types, various shooting angles, multiple languages

Collection equipment

cellphone, computer

Data format

the image format is .jpg and other common formats, the annotation document format is .json

Annotation content

quadrilateral or polygonal annotation at the row (column) level, transcription of content at the row (column) level

Acuuracy rate

the accuracy of the row-level detection boxes is no less than 97%. If the boxes are correctly arranged in rows and the deviation from the edges is no more than 5 pixels, they are considered as correctly labeled The transcribing accuracy at the row and character levels is no less than 97%

500,000 Images – Multilingual OCR Dataset in 21 Languages

multilingual OCR dataset scene text recognition data document OCR dataset electronic screen OCR data OCR dataset 21 languages AI OCR training data text recognition dataset

Current Project Maturity

multilingual OCR dataset

scene text recognition data

document OCR dataset

electronic screen OCR data

OCR dataset 21 languages

AI OCR training data

text recognition dataset