An evaluation of the role of statistical measures and frequency for MWE identification

We report on an experiment to evaluate the role of statistical association measures and frequency for the identification of MWE. We base our evaluation on a lexicon of 14.000 MWE comprising different types of word combinations: collocations, nominal compounds, light verbs + predicate, idioms, etc. These MWE were manually validated from a list of n-grams extracted from a 50 million word corpus of Portuguese (a subcorpus of the Reference Corpus of Contemporary Portuguese), using several criteria: syntactic fixedness, idiomaticity, frequency and Mutual Information measure, although no threshold was established, either in terms of group frequency or MI. We report on MWE that were selected on the basis of their syntactic and semantics properties while the MI or both the MI and the frequency show low values, which would constitute difficult cases to establish a cutting point. We analyze the MI values of the MWE selected in our gold dataset and, for some specific cases, compare these values with two other statistical measures.

[1]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[2]  Darren Pearce A Comparative Evaluation of Collocation Extraction Techniques , 2002, LREC.

[3]  J. Firth,et al.  Papers in linguistics, 1934-1951 , 1957 .

[4]  Carlos Ramisch,et al.  Validation and Evaluation of Automatically Acquired Multiword Expressions for Grammar Engineering , 2007, EMNLP.

[5]  Pavel Pecina AMachine Learning Approach to Multiword Expression Extraction , 2008 .

[6]  Amália Mendes,et al.  COMBINA-PT: A Large Corpus-extracted and Hand-checked Lexical Database of Portuguese Multiword Expressions , 2006, LREC.

[7]  John Sinclair,et al.  Corpus, Concordance, Collocation , 1991 .

[8]  Mireille Bilger Corpus : méthodologie et applications linguistiques , 2000 .

[9]  Amália Mendes,et al.  An electronic dictionary of collocations for European Portuguese: methodology, results and applications , 2002 .

[10]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[11]  Ray Jackendoff,et al.  The Architecture of the Language Faculty , 1996 .

[12]  Thiago Alexandre Salgueiro Pardo,et al.  Computational Processing of the Portuguese Language - 11th International Conference, PROPOR 2014, São Carlos/SP, Brazil, October 6-8, 2014. Proceedings , 2014, Lecture Notes in Computer Science.

[13]  Stefan Evert,et al.  Methods for the Qualitative Evaluation of Lexical Association Measures , 2001, ACL.

[14]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[15]  C. I. Lewis The Modes of Meaning , 1943 .

[16]  Carlos Ramisch,et al.  An Evaluation of Methods for the Extraction of Multiword Expressions , 2008, LREC 2008.

[17]  Michel Généreux,et al.  A Large Portuguese Corpus On-Line: Cleaning and Preprocessing , 2012, PROPOR.