Combining Different Features of Idiomaticity for the Automatic Classification of Noun+Verb Expressions in Basque

We present an experimental study of how different features help measuring the idiomaticity of noun+verb (NV) expressions in Basque. After testing several techniques for quantifying the four basic properties of multiword expressions or MWEs (institutionalization, semantic non-compositionality, morphosyntactic fixedness and lexical fixedness), we test different combinations of them for classification into idioms and collocations, using Machine Learning (ML) and feature selection. The results show the major role of distributional similarity, which measures compositionality, in the extraction and classification of MWEs, especially, as expected, in the case of idioms. Even though cooccurrence and some aspects of morphosyntactic flexibility contribute to this task in a more limited measure, ML experiments make benefit of these sources of knowledge, allowing to improve the results obtained using exclusively distributional similarity features.

[1]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[2]  Iñaki Alegria,et al.  Measuring the compositionality of NV expressions in Basque by means of distributional similarity techniques , 2012, LREC.

[3]  Colin Bannard A Measure of Syntactic Flexibility for Automatically Identifying Multiword Expressions in Corpora , 2007 .

[4]  Satanjeev Banerjee,et al.  The Design, Implementation, and Use of the Ngram Statistics Package , 2003, CICLing.

[5]  Anna Feldman,et al.  Like Finding a Needle in a Haystack: Annotating the American National Corpus for Idiomatic Expressions , 2010, LREC.

[6]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[7]  Daniel Jurafsky,et al.  Is Knowledge-Free Induction of Multiword Unit Dictionary Headwords a Solved Problem? , 2001, EMNLP.

[8]  Tim van de Cruys,et al.  Semantics-based Multiword Expression Extraction , 2007 .

[9]  Timothy Baldwin,et al.  Multiword Expressions , 2010, Handbook of Natural Language Processing.

[10]  Aravind K. Joshi,et al.  Measuring the Relative Compositionality of Verb-Noun (V-N) Collocations by Integrating Features , 2005, HLT.

[11]  Suzanne Stevenson,et al.  Distinguishing Subtypes of Multiword Expressions Using Linguistically-Motivated Statistical Measures , 2007 .

[12]  S. Evert,et al.  Determining Intercoder Agreement for a Collocation Identification Task , 2004 .

[13]  Christian Biemann,et al.  Distributional Semantics and Compositionality 2011: Shared Task Description and Results , 2011 .

[14]  Violeta Seretan Syntax-Based Collocation Extraction , 2010 .

[15]  Sheng Li,et al.  A new collocation extraction method combining multiple association measures , 2008, 2008 International Conference on Machine Learning and Cybernetics.

[16]  Iñaki Alegria,et al.  Automatic Extraction of NV Expressions in Basque: Basic Issues on Cooccurrence Techniques , 2011, MWE@ACL.

[17]  Gérard Biau,et al.  Analysis of a Random Forests Model , 2010, J. Mach. Learn. Res..

[18]  Jochen L. Leidner,et al.  Handbook of Natural Language Processing (second edition) , 2011 .

[19]  Eugenie Giesbrecht,et al.  Automatic Identification of Non-Compositional Multi-Word Expressions using Latent Semantic Analysis , 2006 .

[20]  Stefan Evert,et al.  The Statistics of Word Cooccur-rences: Word Pairs and Collocations , 2004 .

[21]  Koldo Gojenola,et al.  Design and Evaluation of an Agreement Error Detection System: Testing the Effect of Ambiguity, Parser and Corpus Type , 2010, IceTAL.

[22]  Sylviane Granger,et al.  Disentangling the phraseological web , 2008 .

[23]  Pavel Pecina,et al.  Lexical association measures and collocation extraction , 2009, Lang. Resour. Evaluation.

[24]  Alexander Gelbukh,et al.  Computational Linguistics and Intelligent Text Processing , 2015, Lecture Notes in Computer Science.

[25]  Timothy Baldwin,et al.  An Empirical Model of Multiword Expression Decomposability , 2003, ACL 2003.