Russian-Language Thesauri: Automatic Construction and Application for Natural Language Processing Tasks

The paper overviews the existing digital Russian-language thesauri and the methods of their automatic construction and application. The authors have analyzed the main characteristics of thesauri published in open access for scientific research, evaluated trends of their development, and their effectiveness in solving natural language processing tasks. Statistical and linguistic methods of thesaurus construction that allow automation of their development and reduce the labor costs of expert linguists have been studied. In particular, algorithms for extracting keywords and semantic thesaurus relations of all types have been considered and the quality of the thesauri generated with the use of these tools was assessed. To illustrate features of various methods of constructing thesaurus relations, the authors developed a combined method that fully automatically generates a specialized thesaurus based on a text corpus of a selected domain and several existing linguistic resources. The proposed method was used to conduct experiments on two Russian-language text corpora that represent two different domains: articles on migration and tweets. The resulting thesauri were analyzed by means of an integrated assessment that had been developed by the authors in a previous study and allows one to determine various aspects of the analyzed thesaurus and appraise the quality of the methods of its generation. The analysis revealed the main advantages and disadvantages of various approaches to thesaurus construction and extraction of semantic relations of different types, and also made it possible to identify potential focus areas for future research.

[1]  Artem Lukanin,et al.  Automatic Extraction of Hypernyms and Hyponyms from Russian Texts , 2014, AIST.

[2]  E. E. Kotova,et al.  Construction of thematic ontologies using the method of automated thesauri development , 2016, 2016 IEEE NW Russia Young Researchers in Electrical and Electronic Engineering Conference (EIConRusNW).

[3]  O. S. Smirnova,et al.  Defining the scope semantics by forming its thesaurus , 2016 .

[4]  Natalia V. Loukachevitch,et al.  Creating a General Russian Sentiment Lexicon , 2016, LREC.

[5]  Irina V. Azarowa RussNet as a Computer Lexicon for Russian , 2008 .

[6]  Véronique Hoste,et al.  Evaluation of Automatic Hypernym Extraction from Technical Corpora in English and Dutch , 2014, LREC.

[7]  N. S. Lagutina,et al.  Analysis of Influence of Different Relations Types on the Quality of Thesaurus Application to Text Classification Problems , 2019, Automatic Control and Computer Sciences.

[8]  Seo-Young Noh,et al.  A Lightweight Program Similarity Detection Model using XML and Levenshtein Distance , 2006, FECS.

[9]  Ksenia Lagutina,et al.  Thesaurus-Based Method of Increasing Text-via-Keyphrase Graph Connectivity During Keyphrase Extraction for e-Tourism Applications , 2016, KESW.

[10]  Ksenia Lagutina,et al.  Sentiment classification of long newspaper articles based on automatically generated thesaurus with various semantic relationships , 2017, 2017 21st Conference of Open Innovations Association (FRUCT).

[11]  Natalia V. Loukachevitch,et al.  RuThes Linguistic Ontology vs. Russian Wordnets , 2014, GWC.

[12]  Natalia Loukachevitch,et al.  Russian-Tatar Socio-political Thesaurus: Publishing in the Linguistic Linked Open Data Cloud , 2017 .

[13]  Reinhard Rapp The automatic generation of thesauri of related words for English, French, German, and Russian , 2008, Int. J. Speech Technol..

[14]  Alexander Chistyakov,et al.  FOODpedia: Russian Food Products as a Linked Data Dataset , 2015, ESWC.

[15]  E. A. Sidorova ONTOLOGY -BASED APPROACH TO MODELING THE PROCESS OF EXTRACTING INFORMATION FROM TEXT , 2018 .

[16]  Christian Biemann,et al.  Human and Machine Judgements for Russian Semantic Relatedness , 2016, AIST.

[17]  Dmitry Ustalov,et al.  YARN: Spinning-in-Progress , 2016, GWC.

[18]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[19]  David Bawden,et al.  Thesaurus Construction and Use: A Practical Manual , 2000 .

[20]  Natalia V. Loukachevitch,et al.  Extraction of Russian Sentiment Lexicon for Product Meta-Domain , 2012, COLING.

[21]  Alexey Alekseev Тематическое представление новостного кластера как основа для автоматического аннотирования (Thematic Representation of a News Cluster as a Basis for Summarization) , 2013, RCDL.

[22]  Vincent Ng,et al.  Automatic Keyphrase Extraction: A Survey of the State of the Art , 2014, ACL.

[23]  Peter Wiemer-Hastings,et al.  Latent semantic analysis , 2004, Annu. Rev. Inf. Sci. Technol..

[24]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..