Analysis of information criteria of relevant feature selection in text mining methods

Authors

DOI:

https://doi.org/10.17308/sait.2020.2/2924

Keywords:

text mining, feature selection methods, term frequency, collection of documents, criteria evaluation

Abstract

In this paper, a quantitative and qualitative assessment of document feature selection methods based on information theory was conducted. The aim of the research was to verify the application of a number of criteria for reduction of a multitude of terms in a collection of texts, to which supervised and unsupervised classification methods would be subsequently applied. The input data for the implemented software was divided by the similarity of topics and, depending on the experiment, included sets of 45 documents of three categories of technical texts in various concentrations. The TextStageProcessor software system for text mining, an open source code project, was used to calculate the criteria. Two values were introduced in the section of criteria performance evaluation. The first determined the relative number of documents which belonged to the category and contained a specified term. The second one was equivalent to the relative number of documents which belonged to the category and did not contain the specified term. Graphs for the dependence of the above-mentioned values on the criteria were constructed. Limitations for the specified parameters were considered. The results obtained for MI, CHI, and IG criteria are not monotonous, which indicates the possible inoperability of these criteria for the input collection of documents and the need for further research. The texts were preprocessed for the second part of the experiment, which included the removal of stop words, normalising the terms, and making them lowercase. The quality view of the graphs of the dependence of TFD, DF, and TF∙IDF criteria on the word rank in the collection shows that they can be used to reduce the multitude of relevant input terms for the classification with no loss in quality of the research.

Author Biographies

  • Alexander L. Kalabin, Tver State Technical University

    DSc in Physics and Mathematics, Professor, Head of Software Department of Tver State Technical University

  • Yelena I. Korneeva, Tver State Technical University

    4th year postgraduate student, Software Department of Tver State Technical University

References

Downloads

Published

2020-06-15

Issue

Section

Computer Linguistics and Natural Language Processing

How to Cite

Analysis of information criteria of relevant feature selection in text mining methods. (2020). Proceedings of Voronezh State University. Series: Systems Analysis and Information Technologies, 2, 150-159. https://doi.org/10.17308/sait.2020.2/2924