РАЗРАБОТКА МЕТОДА ОБЪЯСНИМОГО  НЕСТРОГОГО СОПОСТАВЛЕНИЯ ТЕКСТОВЫХ ДОКУМЕНТОВ  В УСЛОВИЯХ «ХОЛОДНОГО СТАРТА» С ОБРАТНОЙ СВЯЗЬЮ

Павел Валерьевич Корытов; Иван Иванович Холод

doi:10.17308/sait/1995-5499/2025/4/183-197

Authors

Pavel Valerievich Korytov Saint-Petersburg Electrotechnical University “LETI” https://orcid.org/0000-0001-5534-5389 (unauthenticated)
Ivan Ivanovich Kholod Saint-Petersburg Electrotechnical University “LETI” https://orcid.org/0000-0002-7255-5035 (unauthenticated)

DOI:

https://doi.org/10.17308/sait/1995-5499/2025/4/183-197

Keywords:

text matching, fuzzy search, explainable machine learning, natural language proces sing, named entity recognition, word2vec, BERT

Abstract

This article addresses the problem of fuzzy text document matching, i.e., determining the degree of their semantic similarity. The task is relevant, for instance, in the case of searching for documents in a corpus that are similar to the given one; this work takes the selection of job vacancies that match course descriptions as an example. The goal of this work is to develop an explainable fuzzy text document matching method that operates under «cold start» conditions (without a labeled dataset for initial training) with the ability to improve through feedback. The method is based on comparing embeddings of keywords (or named entities) extracted from texts and is supplemented with post-processing using a bi-encoder and a feedback-based learning mechanism. Both additions involve filtering unsuitable documents. Unlike traditional token-based approaches, the method is trainable and takes semantic similarity into account, while unlike neural network approaches (comparing text embeddings or using cross-encoders) – it provides explainability of results. An experimental evaluation of the method was conducted on a corpus of 691 job vacancies and 3860 course descriptions. Among various keyword extraction methods, the use of named entity recognition (NER) models showed the best results, which corresponds to a larger number of extracted keywords per text. When using the NER model, word2vec for keyword embedding, and LaBSE-ru-turbo as the bi-encoder, the evaluation showed an F1-score of 0.79, which exceeds both simple comparison using the bi-encoder (F1=0.76) and the version of the method without feedback and bi-encoder (F1=0.75).

Author Biographies

Pavel Valerievich Korytov, Saint-Petersburg Electrotechnical University “LETI”

graduate student
Ivan Ivanovich Kholod, Saint-Petersburg Electrotechnical University “LETI”

Doctor of Technical Sciences, Professor, Professor at the Department of Information Systems

References

Damerau F. J. A technique for computer detection and correction of spelling errors // Communications of the ACM. – 1964. – V. 7, No 3. – P. 171–176. – DOI: 10.1145/363958.363994.

Jaro M. A. Probabilistic linkage of large public health data files // Statistics in Medicine. – 1995. – V. 14, No 5–7. – С. 491–498. – DOI: 10.1002/sim.4780140510.

Winkler W. The state of record linkage and current research problems // Statist. Med. – 1999. – V. 14.

How different are different diff algorithms in Git?: Use –histogram for code changes // Empirical Software Engineering. – 2020. – V. 25, No 1. – P. 790–823. – DOI: 10.1007/s10664-01909772-z.

A General Edit Distance between RNA Structures / T. Jiang, G. Lin, B. Ma, K. Zhang // Journal of Computational Biology. – 2002. – V. 9, No 2. – P. 371–388. – DOI: 10.1089/10665270252935511.

Henzinger M. Finding near-duplicate web pages: a large-scale evaluation of algorithms // Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR06: The 29th Annual International SIGIR Conference). – Seattle Washington USA : ACM, 06.08.2006. – P. 284–291. – DOI: 10.1145/1148170.1148222.

Broder A. On the resemblance and containment of documents // Proceedings. Compression and Complexity of SEQUENCES 1997 – Salerno, Italy : IEEE. – P. 21–29. – DOI: 10.1109/SEQUEN.1997.666900.

Manku G. S., Jain A., Das Sarma A. Detecting near-duplicates for web crawling // Proceedings of the 16th international conference on World Wide Web (WWW’07: 16th International World Wide Web Conference). – Banff Alberta Canada : ACM, 08.05.2007. – P. 141–150. – DOI: 10.1145/1242572.1242592.

Approximate String Matching Techniques // Proceedings of the 16th International Conference on Enterprise Information Systems (16th International Conference on Enterprise Information Systems). – Lisbon, Portugal : SCITEPRESS – Science and Technology Publications, 2014. – P. 217–224. – DOI: 10.5220/0004892802170224.

A comparison of string distance metrics for name-matching tasks // Proceedings of the 2003 international conference on information integration on the web. – Acapulco, Mexico : AAAI Press, 2003. – P. 73–78. – (IIWEB’03).

Reimers N., Gurevych I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks // Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). – Hong Kong, China : Association for Computational Linguistics, 2019. – P. 39823992. – DOI: 10.18653/v1/D19-1410.

Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks / N. Thakur, N. Reimers, J. Daxenberger, I. Gurevych // Proceedings of the 2021 Conference of the North American Chapterof the Association for Computational Linguistics: Human Language Technologies (Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies). – Association for Computational Linguistics, 2021. – P. 296–310. – DOI: 10.18653/v1/2021.naacl-main.28.

MTEB: Massive Text Embedding Benchmark / N. Muennighoff, N. Tazi, L. Magne, N. Reimers // Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. – Dubrovnik, Croatia: Association for Computational Linguistics, 2023. – P. 2014–2037. – DOI: 10.18653/v1/2023.eacl-main.148.

Boudin F. pke: an open source python-based keyphrase extraction toolkit // Proceedings of COLING 2016, the 26th international conference on computational linguistics: System demonstrations. – Osaka, Japan, 12.2016. – P. 69–73.

Bougouin A., Boudin F., Daille B. TopicRank: Graph-based topic ranking for keyphrase extraction // Proceedings of the sixth international joint conference on natural language processing. – Nagoya, Japan : Asian Federation of Natural Language Processing, 2013. – P. 543–551.

YAKE! Keyword extraction from single documents using multiple local features / R. Campos [и др.] // Information Sciences. – 2020. – V. 509. – P. 257–289. – DOI: 10.1016/j.ins.2019.09.013.

Florescu C., Caragea C. PositionRank: An Unsupervised Approach to Keyphrase Extraction from Scholarly Documents // Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). – Vancouver, Canada : Association for Computational Linguistics, 2017. – С. 11051115. – DOI: 10.18653/v1/P17-1102.

Analysis of Approaches for Identifying Key Skills in Vacancies / P. V. Korytov, E. A. Andreeva, Y. Y. Gribetsky, I. I. Kholod // 2024 XXVII International Conference on Soft Computing and Measurements (SCM-2024). – Saint-Petersburg: IEEE, 2024. – P. 242–245. – DOI: 10.1109/SCM62608.2024.10554269

Kholod I. I., Korytov P. V., Sorochina M. V. Application of Neural Network Keyword Extraction Methods for Student’s CV Compilation from Discipline Work Programs // 2023 XXVI International Conference on Soft Computing and Measurements (SCM) – Saint-Petersburg : IEEE, 2023. – С. 143–146.– DOI: 10.1109/SCM58628.2023.10159061.

Потапов А. T-Lite и T-Pro – открытые русскоязычные опенсорс-модели на 7 и на 32 млрд параметров – Режим доступа: https://habr.com/ru/companies/tbank/articles/865582/ (Дата обращения: 03.01.2025).

Wan X., Xiao J. CollabRank: Towards a Collaborative Approach to Single Document Keyphrase Extraction // Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008). – Coling 2008 Organizing Committee, 2008. – С. 969–976.

Mihalcea R., Tarau P. TextRank: Bringing order into text // Proceedings of the 2004 conference on empirical methods in natural language processing. – Barcelona, Spain : Association for Computational Linguistics, 2004. – С. 404–411.

Efficient Estimation of Word Representations in Vector Space / T. Mikolov, K.Chen, G. Corrado, J. Dean // Proceedings of the International Conference on Learning Representations (ICLR 2013). – 2013.

Segalovich I. A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine. // Proc. of MLMTA-2003. –2003.

The Faiss library / Matthijs Douze [и др.] // arXiv preprint. – 2024. Режим доступа: https://arxiv.org/abs/2401.08281 (дата обр. 31.05.2025)

The Russian-focused embedders’ exploration: ruMTEB benchmark and Russian embedding model design / A. Snegirev [и др.] // arXiv preprint. – 2024. Режим доступа: https://arxiv.org/abs/2408.12503 (дата обр. 31.05.2025)

Dale D. Маленький и быстрый BERT для русского языка. – Режим доступа: https://habr.com/ru/articles/562064/ (дата обр. 31.05.2025).