Improve OCR efficiency with data labeling platform

From:Nexdata Date: 08/15/2024

➤ OCR for insurance document processing

The quality and diversity of datasets determine the intelligence level of AI model. Whether it is used for smart security, autonomous driving, or human-machine interaction, the accuracy of datasets directly affect the performance of the model. With the development of data collection technology, all type of customized datasets are constantly being created to support the optimization of AI algorithm. Though in-depth research on these types of datasets, AI technology’s application prospects will be broader.

Bills and other paper materials can be found everywhere in our daily life. Some of the paper materials is quite important for us. If not kept carefully, it’s easy to get lost and damaged, causing us intractable troubles. In the information era, the management methods of paper materials such as bills and forms are constantly innovating. Nowadays, electronic management becomes the mainstream.

In the past, the informatization of bills, forms and other paper materials completely relied on manual input, which was not only inefficient, error-prone, but also required huge manpower and material resources. Also, the manual data input cannot be applied to AI algorithms.

➤ Shujiajia Pro data processing

Take the insurance industry as an example. By the end of 2018, the total annual income of China’s insurance market reached 3.8 trillion. It has maintained a high growth rate for the past ten years, and the current growth rate is also maintained at about 10%. The growth of the insurance industry has resulted in a large number of insurance documents. In 2017, there were approximately 5.1 billion insurance documents in China. Calculated based on an average annual growth rate of 10%, insurance documents are expected to be 7.5 billion in 2021. In the near future, it will exceed tens of billions per year.

With the OCR bill recognition system, we only need take a photo and scan, the system will automatically collect data on the insurance documents. Nexdata provides efficient insurance document processing solution — Shujiajia Pro data Labeling platform. It is used to complete the core capability construction of the OCR bill recognition system. The OCR bill recognition system mainly includes four modules: OCR pre-identification, manual management, data output, model iteration. They form a “human in the loop” closed loop.

Shujiajia Pro is a data processing platform developed by Nexdata based on years of experience. It covers template tools polished by years of actual production experience, data labeling quality management process, data processing and online pre-identification capabilities.

Shujiajia Pro: A data labeling expert

Based on OCR recognition engine, Shujiajia Pro supports OCR pre-identification service (line-level detection + text transcribing). The pre-identification accuracy reachs 90% (clear font and no large angle tilt).

As the dataset is continuously updated and the algorithm is also iterated, the performance of the algorithm will continue to improve. Shujiajia Pro can be flexibly switched to the customer’s own pre-identification engine. The system and customer’s pre-identification engine are lightly coupled through plugin. Customers only need to develop a Docker image according to the plugin specification and upload it to the system.

Case 1: Value-added tax invoice

After pre-identification processing, the results will be displayed to the annotator in the OCR template. The annotator will correct the error caused by the pre-identification system and then submit it to the quality inspection.

Through OCR pre-recognition engine, the labeling efficiency can be improved by about 30%.

➤ Nexdata's data processing and services

Case 2: Outpatient fee receipt

Nexdata’s OCR pre-identification technology can process multiple bill forms, such as invoices, outpatient fee receipt, taxi bills, insurance documents, hospital records, auto insurance bills, ect.

Through pre-identification processing and manual error correction, the data will be submitted to the quality inspector. The quality inspector will point out the entire image error and label-level error of the data, give the reason for the error, and return the data to the annotator for repair. Many errors types are built in the system, such as: frame error, label object does not match, wrong label and attribute, etc. The system also supports project manager to customize the error type according to the project.

The result data is output in “json” format. For different customer needs, we can provide a variety of online format conversion programs: for example, Pascal VOC (.xml), Labelme (.json) and other output data which can be imported into the data platform, or form a standard AI dataset for algorithm iteration.

End

Nexdata provides a complete and efficient bill management solution through the privatization deployment of the Shujiajia Pro platform, and ensures the privacy and security of customers’ data through privatization deployment.

If you need data services, please feel free to contact us: [email protected]

With the advancement of data technology, we are heading towards a more intelligent world. The diversity and high-quality annotation of datasets will continue to promote the development of AI system, create greater society benefits in the fields like healthcare, intelligent city, education, etc, and realize the in-depth integration of technology and human well-being.