From:Nexdata Date: 2025-08-13
Recently, Google has released the next-generation inference model Gemini 2.5 Pro, which is considered the strongest player in the field of AI. It has demonstrated its capabilities in the field of OCR. No matter it is complex handwriting, ancient documents, or multilingual bills, its near-zero error recognition ability has attracted wild attention in the industry. How was this breakthrough achieved?
Large models typically require massive amounts of high-quality OCR training data for pre-training to achieve accurate recognition capabilities. Nexdata, with years of experience in the OCR field, has built a dataset of over 10 million OCR images, covering 50+ languages, multiple formats, and multiple scenarios. All of these images are manually annotated, providing critical data support for AI model training.
Natural Scene OCR Data
This dataset contains over one million images of natural scene OCR text, covering dozens of languages, including Asian languages like Japanese, Korean, Indonesian, and Malay; European languages like French, German, Italian, and Portuguese; and Southeast Asian languages Khmer (Cambodian), Laotian, and Burmese. The images cover a variety of natural scenes, including slogans, posters, instruction manuals, and menus. These images were captured using mobile phones, cameras, and scanners, using multiple angles, including upward, downward, and horizontal viewing angles. The accuracy of capture, annotation, and text transcription exceeds 97%, making it suitable for multi-language natural scene OCR tasks.
This dataset contains over 100,000 images of handwritten text in multiple languages and scenarios, covering Traditional Chinese, English, Japanese, Korean, Spanish, Portuguese, and French. The images were captured from various text media, including blackboards, whiteboards, greenboards, A4 paper, and lined paper. The images are captured from various angles, including horizontal, downward, and upward, and show different handwriting styles and content. The collection, annotation, and text transcription accuracy all reach over 97%, making it suitable for handwriting OCR tasks.
OCR Data for Special-Shaped Text
This dataset of over 50,000 Chinese special-shaped text images covers a variety of natural scenes (street scenes, plaques, billboards, posters, decorations, art, magazine covers), various layouts (waves, circles, etc.), and fonts. The collection uses semantically-based polygonal and quadrilateral box annotation and transcription, achieving annotation and text transcription accuracy exceeding 97%, making it suitable for special-shaped text OCR tasks.
This dataset of over 10 million documents contains a variety of document data, including manuals, office documents, historical works, and tables. The dataset covers Chinese, English, Hindi, and other languages, and includes PDF and image formats. It meets the requirements of complex layout OCR and strictly transcribes text based on text position. Its detection box annotation and text transcription accuracy reach over 95%, making it suitable for document OCR tasks like table detection and recognition, and article layout segmentation and analysis.
Bill OCR Data
This dataset contains hundreds of thousands of multinational bill OCR images from countries including Arabia, Mexico, Brazil, and India, primarily in Arabic, Portuguese, Spanish, and English. It encompasses a variety of bill types, transcribes text within images according to their original layout, prioritizes line alignment, and desensitizes personal information. It can be used for bill recognition and text recognition.
Question-Answer OCR Data
This dataset contains over 20,000 sets of Chinese question-answer OCR data, encompassing a variety of scenes, including billboards, posters, handwritten newspapers, and street scenes, with various layouts and fonts. Each image contains a question-answer pair, and the answer is annotated with a polygonal box within the image. The annotation accuracy, text transcription accuracy, and answer accuracy all exceed 97%. This data provides a rich resource for large multimodal models. Validated by multiple AI companies, it helps models perform well in real-world applications.
Exam Question OCR Data
This dataset of nearly 60,000 exam questions covers subjects from K12 to university. It includes a variety of question types, including multiple-choice, fill-in-the-blank, short-answer, and solutions, as well as illustrations included in the answers. These questions are captured using mobile phones and scanners. Question stems, options, answers, and accompanying images are annotated and transcribed with rectangles. Formulas and tables are transcribed using LaTeX format. The accuracy of question type collection and classification is at least 97%, making it suitable for intelligent grading and homework tutoring.
Continuous breakthroughs in OCR technology are inseparable from the continuous supply of high-quality data. High-quality datasets are the cornerstone of the development of artificial intelligence technology. Whether it is current application or future development, the importance of datasets is unneglectable. With the in-depth application of AI in all walks of life, we have reason to believe by constant improving datasets, future intelligent system will become more efficient, smart and secure.