Разработка нового экспериментального метода оценки OCR инструментов для задачи классификации цифровых документов

Алла Григорьевна Кравец; Дмитрий Олегович Семёночкин; Андрей Константинович Марков

doi:10.17308/sait/1995-5499/2024/3/114-126

Authors

Alla G. Kravets Volgograd State Technical University https://orcid.org/0000-0003-1675-8652 (unauthenticated)
Dmitry O. Semenochkin Volgograd State Technical University https://orcid.org/0009-0008-2352-4313 (unauthenticated)
Andrey K. Markov Volgograd State Technical University https://orcid.org/0009-0001-6452-0502 (unauthenticated)

DOI:

https://doi.org/10.17308/sait/1995-5499/2024/3/114-126

Keywords:

OCR tools, text recognition, text classification, experimental evaluation, digital documents

Abstract

The paper describes a developed experimental method for evaluating existing OCR tools to address the presence of scanned documents in datasets used for text classification tasks. For document classification, scanned documents and documents in which text cannot be retrieved by text extraction software tools need to be converted into machine-readable text, and optical character recognition (OCR) technology is used for this task. The purpose of this paper is to experimentally compare existing OCR tools, namely the quality of conversion of scanned documents into text. The main criteria for choosing an OCR tool were: the OCR tool should be freely distributable, have built-in support for the Russian language and be an actively developing project. Three tools fit these criteria: Tesseract, EasyOCR and PaddleOCR. For this task, a corpus of digital documents was compiled, half of which were scanned documents. The documents were taken from open sources: 4 out of 6 presented classes were documents related to the process of studying in higher education institutions of the Russian Federation, the other 2 were documents from public procurement: contracts and technical specifications. The experimental design included training the Longformer classifier, a transformer for processing long documents, on datasets created by three different OCR tools. The OCR tools were evaluated on the quality of text classification achieved by Longformer. The results of the experiment showed that Tesseract OCR demonstrates superiority in text recognition accuracy, which influenced the resulting classification accuracy of the text extracted from the documents.

Author Biographies

Alla G. Kravets, Volgograd State Technical University

doctor of Technical Sciences, Professor, Professor of the department «Systems of computer-aided design and search design», Volgograd State Technical University
Dmitry O. Semenochkin, Volgograd State Technical University

master’s student of the 2nd year of study of the department «Systems of computer-aided design and search design», Volgograd State Technical University
Andrey K. Markov, Volgograd State Technical University

postgraduate student of the department «Systems of computer-aided design and search design», Volgograd State Technical University