A Dataset for Noun Compositionality Detection for a Slavic Language

This paper presents the first gold-standard resource for Russian annotated with compositionality information of noun compounds. The compound phrases are collected from the Universal Dependency treebanks according to part of speech patterns, such as ADJ+NOUN or NOUN+NOUN, using the gold-standard annotations. Each compound phrase is annotated by two experts and a moderator according to the following schema: the phrase can be either compositional, non-compositional, or ambiguous (i.e., depending on the context it can be interpreted both as compositional or noncompositional). We conduct an experimental evaluation of models and methods for predicting compositionality of noun compounds in unsupervised and supervised setups. We show that methods from previous work evaluated on the proposed Russian-language resource achieve the performance comparable with results on English corpora.

[1]  Timothy Baldwin,et al.  An Empirical Model of Multiword Expression Decomposability , 2003, ACL 2003.

[2]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[3]  Carlos Ramisch,et al.  Predicting the Compositionality of Nominal Compounds: Giving Word Embeddings a Hard Time , 2016, ACL.

[4]  Ilya Segalovich,et al.  A Fast Morphological Algorithm with Unknown Word Guessing Induced by a Dictionary for a Web Search Engine , 2003, MLMTA.

[5]  Iñaki Alegria,et al.  Combining Different Features of Idiomaticity for the Automatic Classification of Noun+Verb Expressions in Basque , 2013, MWE@NAACL-HLT.

[6]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[7]  Carlos Ramisch,et al.  How Naked is the Naked Truth? A Multilingual Lexicon of Nominal Compound Compositionality , 2016, ACL.

[8]  Mirella Lapata,et al.  Vector-based Models of Semantic Composition , 2008, ACL.

[9]  Harry Zhang,et al.  The Optimality of Naive Bayes , 2004, FLAIRS.

[10]  Timothy Baldwin,et al.  Extracting the Unextractable: A Case Study on Verb-particles , 2002, CoNLL.

[11]  Adam Przepiórkowski,et al.  PARSEME – PARSing and Multiword Expressions within a European multilingual network , 2015 .

[12]  Joakim Nivre,et al.  A Multiword Expression Data Set: Annotating Non-Compositionality and Conventionalization for English Noun Compounds , 2015, MWE@NAACL-HLT.

[13]  Steven Schockaert,et al.  SeVeN: Augmenting Word Embeddings with Unsupervised Relation Vectors , 2018, COLING.

[14]  Suresh Manandhar,et al.  An Empirical Study on Compositionality in Compound Nouns , 2011, IJCNLP.

[15]  John Carroll,et al.  Detecting a Continuum of Compositionality in Phrasal Verbs , 2003, ACL 2003.

[16]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[17]  Aravind K. Joshi,et al.  Measuring the Relative Compositionality of Verb-Noun (V-N) Collocations by Integrating Features , 2005, HLT.

[18]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[19]  Sabine Schulte im Walde,et al.  The (Un)expected Effects of Applying Standard Cleansing Models to Human Ratings on Compositionality , 2013, MWE@NAACL-HLT.

[20]  Geoffrey E. Hinton Connectionist Learning Procedures , 1989, Artif. Intell..

[21]  Jason Weston,et al.  Connecting Language and Knowledge Bases with Embedding Models for Relation Extraction , 2013, EMNLP.

[22]  Alexander Panchenko,et al.  On the Compositionality Prediction of Noun Phrases using Poincaré Embeddings , 2019, ACL.

[23]  Sampo Pyysalo,et al.  Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.