Исследование лексики текстов жанра 2.0 методами квантитативной и корпусной лингвистики (на примере текстов Истаграм)

О. В. Донина

doi:10.17308/lic.2020.3/2928

Authors

O. V. Donina Voronezh State University

DOI:

https://doi.org/10.17308/lic.2020.3/2928

Keywords:

Instagram, quantitative analysis, АntConc, stylometry, R, Voyant Tools, visualization, genre 2.0

Abstract

The article discusses possibilities of using the corpus linguistics’ and quantitative linguistics’ methods in the study of the genre 2.0 texts (on the example of Instaram texts). The volume of the research corpus was 43,000 words. All texts were divided into three groups depending on the number of subscribers of their authors (100 thousand – 400 thousand; 500 thousand – 900 thousand; 1 million – ∞). Among the tasks solved within the frame-work of the article weree: 1) preliminary data processing (lemmatization and the removal of stop words); 2) identification of keywords using AntConc; 3) data visualization with the help of Voyant Tools; 4) clustering using R language; 5) comparison of the obtained indicators by authors and by the above-mentioned groups. According to the research hypothesis, the selected groups of Instagram bloggers should be clustered by authors depending on the number of their subscribers. If the hypothesis were confirmed, it would allow developing an automatic classifier of Instagram texts. The most frequent words for the entire research corpus were found (svoy (own); ochen (very); samyi (most); bolshoi (big); god (year)). The frequencies of using these words were compared by groups (the most significant deviation was 0.26 %) and by authors within groups (where the indicated value varied from 0.5 % to 0.75 %). The multiple correlation coefficient also showed that the similarity of the frequency distribution of words was higher between groups (45 %) than between authors within the groups (varied from 15 % to 35 %). Then the top 20 most frequent words of each group were compared and POS preferences were indicated: half of the words in the first group were represented by adjectives, while in the third group 45 % were nouns. Then the percentage of unique vocabulary was calculated by group (unique vocabulary was 74.9 %) and by authors (unique vocabulary was 70.6 %). The last step included the usage of the method of stylemetry, which also did not reveal groups depending on the number of subscribers. Despite the fact that the research hypothesis was not confirmed and no statistically significant distinctive features of the mentioned groups (depending on the number of subscribers) was identified, a comprehensive toolkit for quantitative text analysis was proposed.