From:Nexdata Date: 2024-08-15
With the widespread machine learning technology, data’s importance shown. Datasets isn’t just provide the foundation for the architecture of AI system, but also determine the breadth and depth of applications. From anti-spoofing to facial recognition, to autonomous driving, perceived data collection and processing have become a prerequisites for achieving technological breakthroughs. Hence, high-quality data sources are becoming an important asset for market competitiveness.
Optical character recognition (OCR) is the task of electronic devices such as scanners or digital cameras examining characters in an image and then translating the shapes into computer text using character recognition methods.
OCR application scenarios are relatively rich, including natural scenes, handwriting scenes, document recognition, etc. Natural scenario OCR is one of the most widely used scenarios in OCR tasks and with huge market demand. Natural scenes OCR is involved in people’s daily life. The text carriers can usually be store plaques, stop signs, posters, road signs, cartoons, manhole cover paintings, prompts, warnings, packaging instructions, menus, building signs, etc.
Natural Scene OCR Data Annotation
According to the different degrees of labeling fineness, it can usually be divided into text line-level labeling and character-level labeling (word-level labeling will also be performed if there are words in the Latin language). The labeling method is usually text box + character transcription. Based on different task requirements, the text box can be a rectangular box or a quadrilateral box.
Challenges of Natural Scene OCR Tasks
From a technical point of view, the natural scene OCR task has the following four difficulties:
● Language
Different countries have different common languages, and the character morphology of different languages is also very different, which increases the difficulty of OCR algorithm recognition.
● Complicated Font
In natural scenes, texts are usually artistic fonts, and the status of artistic fonts is quite different from that of standard fonts. In addition, factors such as different text sizes and changing colors in natural scenes further increase the difficulty of OCR tasks.
● Various Shooting Angles
Most users will use mobile phones as the device for shooting text. Different users have different shooting habits, which will lead to various shooting angles during shooting, which poses a challenge to the robustness of the OCR algorithm to angle inclination.
● Diverse Character Carriers
The distribution of OCR text carriers in natural scenes is relatively rich, and some carriers will cause text distortion. For example, food packaging is often deformed, resulting in the bending of text, which increases the difficulty of OCR tasks.
According to the needs and difficulties of OCR tasks in natural scenes, Nexdata has developed a series of datasets, covering multiple languages, multiple scenes, and multiple shooting angles, etc.
Natural Scenes OCR Data of 12 Languages
The data covers 12 languages (6 Asian languages, 6 European languages), multiple natural scenes, multiple photographic angles. For annotation, line-level quadrilateral bounding box annotation and transcription for the texts were annotated in the data.
English OCR Data in Natural Scenes
The collecting scenes of this dataset are the real scenes in Britain and the United States. The data diversity includes multiple scenes, multiple photographic angles and multiple light conditions. For annotation, line-level & word-leve & character-level rectangular bounding box or quadrilateral bounding box annotation were adopted, the text transcription was also adopted.
Handwriting OCR Data of Japanese and Korean
This dadaset was collected from 100 subjects including 50 Japanese, 49 Koreans and 1 Afghan. For different subjects, the corpus are different. The data diversity includes multiple cellphone models and different corpus.
Hindi OCR Images Data — Images with Annotation and Transcription
The data includes 2,056 images of natural scenes, 1,103 Internet images and 347 document images. For line-level content annotation, line-level quadrilateral bounding box annotation and test transcription was adpoted; for column-level content annotation, column-level quadrilateral bounding box annotation and text transcription was adpoted.
Vietnamese OCR Images Data — Images with Annotation and Transcription
The data includes 258 images of natural scenes, 2,553 Internet images, 2,184 document images. For line-level content annotation, line-level quadrilateral bounding box annotation and test transcription was adpoted; for column-level content annotation, column-level quadrilateral bounding box annotation and text transcription was adpoted.
If you want to know more details about the datasets or how to acquire, please feel free to contact us: info@nexdata.ai.
In the future, as all kinds of data are collected and annotated, how will AI technology change our lives gradually? The future of AI data is full of potential, let’s explore its infinity together. If you have data requirements, please contact Nexdata.ai at [email protected].