From:Nexdata Date: 2024-08-15
In the field of machine learning and deep learning, datasets plays an irreplaceable role. No matter it is image data for convolutional neural networks or massive text data for natural language processing, the integrity and diversity of data directly determine the learning results of a model. With the advancement of technology, datasets that collected from specific scenarios have becomes the core strategy for improving model performance.
Optical character recognition (OCR) refers to the task of an electronic device such as a scanner or a digital camera examining characters in an image, and then using character recognition methods to translate the shape into computer text. Some applications of OCR include automated data entry for business documents, translation applications, online databases, security cameras that automatically recognize license plates, and more.
In this article, I sorted out some commonly used datasets in the field of OCR research.
1. COCO-Text
The COCO-Text dataset contains 63,686 images with 145,859 cropped text instances. It is the first large-scale dataset for text in natural images and also the first dataset to annotate scene text with attributes such as legibility and type of text. However, no lexicon is associated with COCO-Text.
SynthText (ST) can be said to be ImageNet in the field of OCR. The data set is generated by synthesis, and 8 million texts are artificially added to 800,000 pictures, and this synthesis is not a very blunt superposition, but some processing is done to make the text look more natural in the picture.
3. IIIT5K
The IIIT5K dataset contains 5,000 text instance images: 2,000 for training and 3,000 for testing. It contains words from street scenes and from originally-digital images. Every image is associated with a 50 -word lexicon and a 1,000 -word lexicon. Specifically, the lexicon consists of a ground-truth word and some randomly picked words.
The SVT dataset contains 350 images: 100 for training and 250 for testing. Some images are severely corrupted by noise, blur, and low resolution. Each image is associated with a 50 -word lexicon.
5. CUTE80
The CUTE80 dataset contains 80 high-resolution images with 288 cropped text instances. It focuses on curved text recognition. Most images in CUTE80 have a complex background, perspective distortion, and poor resolution. No lexicon is associated with CUTE80.
6. SVHN
The SVHN dataset contains more than 600,000 digits of house numbers in natural scenes. It is obtained from a large number of street view images using a combination of automated algorithms and the Amazon Mechanical Turk (AMT) framework. The SVHN dataset was typically used for scene digit recognition.
7. RCTW-17
The RCTW-17 dataset contains 12,514 images: 11,514 for training and 1,000 for testing. Most are natural images collected by cameras or mobile phones, whereas others are digital-born. Text instances are annotated with labels, fonts, languages, etc.
8. MLT(MLTcompetition, ICDAR2019)
The MLT-2019 dataset contains 20,000 images: 10,000 for training (1,000 per language) and 10,000 for testing. The dataset includes ten languages, representing seven different scripts: Arabic, Bangla, Chinese, Devanagari, English, French, German, Italian, Japanese, and Korean. The number of images per script is equal.
In the development of artificial intelligence, the importance of datasets are no substitute. For AI model to better understanding and predict human behavior, we have to ensure the integrity and diversity of data as prime mission. By pushing data sharing and data standardization construction, companies and research institutions will accelerate AI technologies maturity and popularity together.