Development of a new experimental method for evaluating ocr tools for the task of digital document classification
DOI:
https://doi.org/10.17308/sait/1995-5499/2024/3/114-126Keywords:
OCR tools, text recognition, text classification, experimental evaluation, digital documentsAbstract
The paper describes a developed experimental method for evaluating existing OCR tools to address the presence of scanned documents in datasets used for text classification tasks. For document classification, scanned documents and documents in which text cannot be retrieved by text extraction software tools need to be converted into machine-readable text, and optical character recognition (OCR) technology is used for this task. The purpose of this paper is to experimentally compare existing OCR tools, namely the quality of conversion of scanned documents into text. The main criteria for choosing an OCR tool were: the OCR tool should be freely distributable, have built-in support for the Russian language and be an actively developing project. Three tools fit these criteria: Tesseract, EasyOCR and PaddleOCR. For this task, a corpus of digital documents was compiled, half of which were scanned documents. The documents were taken from open sources: 4 out of 6 presented classes were documents related to the process of studying in higher education institutions of the Russian Federation, the other 2 were documents from public procurement: contracts and technical specifications. The experimental design included training the Longformer classifier, a transformer for processing long documents, on datasets created by three different OCR tools. The OCR tools were evaluated on the quality of text classification achieved by Longformer. The results of the experiment showed that Tesseract OCR demonstrates superiority in text recognition accuracy, which influenced the resulting classification accuracy of the text extracted from the documents.
References
Downloads
Published
Issue
Section
License
Условия передачи авторских прав in English













