Development of a new experimental method for evaluating ocr tools for the task of digital document classification

Authors

DOI:

https://doi.org/10.17308/sait/1995-5499/2024/3/114-126

Keywords:

OCR tools, text recognition, text classification, experimental evaluation, digital documents

Abstract

The paper describes a developed experimental method for evaluating existing OCR tools to address the presence of scanned documents in datasets used for text classification tasks. For document classification, scanned documents and documents in which text cannot be retrieved by text extraction software tools need to be converted into machine-readable text, and optical character recognition (OCR) technology is used for this task. The purpose of this paper is to experimentally compare existing OCR tools, namely the quality of conversion of scanned documents into text. The main criteria for choosing an OCR tool were: the OCR tool should be freely distributable, have built-in support for the Russian language and be an actively developing project. Three tools fit these criteria: Tesseract, EasyOCR and PaddleOCR. For this task, a corpus of digital documents was compiled, half of which were scanned documents. The documents were taken from open sources: 4 out of 6 presented classes were documents related to the process of studying in higher education institutions of the Russian Federation, the other 2 were documents from public procurement: contracts and technical specifications. The experimental design included training the Longformer classifier, a transformer for processing long documents, on datasets created by three different OCR tools. The OCR tools were evaluated on the quality of text classification achieved by Longformer. The results of the experiment showed that Tesseract OCR demonstrates superiority in text recognition accuracy, which influenced the resulting classification accuracy of the text extracted from the documents.

Author Biographies

  • Alla G. Kravets, Volgograd State Technical University

    doctor of Technical Sciences, Professor, Professor of the department «Systems of computer-aided design and search design», Volgograd State Technical University

  • Dmitry O. Semenochkin, Volgograd State Technical University

    master’s student of the 2nd year of study of the department «Systems of computer-aided design and search design», Volgograd State Technical University

  • Andrey K. Markov, Volgograd State Technical University

    postgraduate student of the department «Systems of computer-aided design and search design», Volgograd State Technical University

References

Published

2024-11-14

Issue

Section

Intelligent Information Systems, Data Analysis and Machine Learning

How to Cite

Development of a new experimental method for evaluating ocr tools for the task of digital document classification. (2024). Proceedings of Voronezh State University. Series: Systems Analysis and Information Technologies, 3, 114-126. https://doi.org/10.17308/sait/1995-5499/2024/3/114-126