Russian word sense induction by clustering averaged word embeddings

The paper reports our participation in the shared task on word sense induction and disambiguation for the Russian language (RUSSE-2018). Our team was ranked 2nd for the wiki-wiki dataset (containing mostly homonyms) and 5th for the bts-rnc and active-dict datasets (containing mostly polysemous words) among all 19 participants. The method we employed was extremely naive. It implied representing contexts of ambiguous words as averaged word embedding vectors, using off-the-shelf pre-trained distributional models. Then, these vector representations were clustered with mainstream clustering techniques, thus producing the groups corresponding to the ambiguous word senses. As a side result, we show that word embedding models trained on small but balanced corpora can be superior to those trained on large but noisy data - not only in intrinsic evaluation, but also in downstream tasks like word sense induction.

[1]  Roberto Navigli,et al.  Clustering and Diversifying Web Search Results with Graph-Based Word Sense Induction , 2013, CL.

[2]  Lidia Pivovarova,et al.  Clustering of Russian Adjective-Noun Constructions using Word Embeddings , 2017, BSNLP@EACL.

[3]  V. Benko,et al.  Very Large russian Corpora : new opportunities and new ChaLLenges , 2016 .

[5]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[6]  Adam Kilgarriff,et al.  "I Don’t Believe in Word Senses" , 1997, Comput. Humanit..

[7]  Ilya Segalovich,et al.  A Fast Morphological Algorithm with Unknown Word Guessing Induced by a Dictionary for a Web Search Engine , 2003, MLMTA.

[8]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[9]  Andrey Kutuzov,et al.  WebVectors: A Toolkit for Building Web Interfaces for Vector Semantic Models , 2016, AIST.

[10]  Anton Osokin,et al.  Breaking Sticks and Ambiguities with Adaptive Skip-gram , 2015, AISTATS.

[11]  Roberto Navigli,et al.  SemEval-2015 Task 13: Multilingual All-Words Sense Disambiguation and Entity Linking , 2015, *SEMEVAL.

[12]  Maria Kunilovskaya,et al.  Size vs. Structure in Training Corpora for Word Embedding Models: Araneum Russicum Maximum and Russian National Corpus , 2017, AIST.

[13]  Jan Snajder,et al.  Leveraging Lexical Substitutes for Unsupervised Word Sense Induction , 2018, AAAI.

[14]  Milan Straka,et al.  Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe , 2017, CoNLL.

[15]  Stefano Faralli,et al.  Unsupervised, Knowledge-Free, and Interpretable Word Sense Disambiguation , 2017, EMNLP.

[16]  Simone Paolo Ponzetto,et al.  BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network , 2012, Artif. Intell..

[17]  Mikhail Kopotev,et al.  Clustering Comparable Corpora of Russian and Ukrainian Academic Texts: Word Embeddings and Semantic Fingerprints , 2016, ArXiv.

[18]  Andrey Kutuzov,et al.  Neural Embedding Language Models in Semantic Clustering of Web Search Results , 2016, LREC.

[19]  Roberto Navigli,et al.  SemEval-2013 Task 11: Word Sense Induction and Disambiguation within an End-User Application , 2013, SemEval@NAACL-HLT.

[20]  Yehoshua Bar-Hillel,et al.  Language and information : selected essays on their theory and application , 1965 .

[21]  Diego R. Amancio,et al.  Word sense induction using word embeddings and community detection in complex networks , 2018, Physica A: Statistical Mechanics and its Applications.

[22]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[23]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[24]  Marianna Apidianaki,et al.  LIMSI: Translations as Source of Indirect Supervision for Multilingual All-Words Sense Disambiguation and Entity Linking , 2015, *SEMEVAL.