Unsupervised Acquisition of Comprehensive Multiword Lexicons using Competition in an n-gram Lattice

We present a new model for acquiring comprehensive multiword lexicons from large corpora based on competition among n-gram candidates. In contrast to the standard approach of simple ranking by association measure, in our model n-grams are arranged in a lattice structure based on subsumption and overlap relationships, with nodes inhibiting other nodes in their vicinity when they are selected as a lexical item. We show how the configuration of such a lattice can be optimized tractably, and demonstrate using annotations of sampled n-grams that our method consistently outperforms alternatives by at least 0.05 F-score across several corpora and languages.

[1]  Noah A. Smith,et al.  Discriminative Lexical Semantic Segmentation with Gaps: Running the MWE Gamut , 2014, TACL.

[2]  Sylviane Granger,et al.  The use of collocations by intermediate vs. advanced non-native writers: A bigram-based study , 2014 .

[3]  Timothy Baldwin,et al.  Bayesian Text Segmentation for Index Term Identification and Keyphrase Extraction , 2012, COLING.

[4]  Franziska Frankfurter,et al.  Constructions: A construction grammar approach to argument structure: Adele E. Goldberg, Chicago, IL: The University of Chicago Press, 1995. xi + 265 pp , 1998 .

[5]  T. Griffiths,et al.  A Bayesian framework for word segmentation: Exploring the effects of context , 2009, Cognition.

[6]  Carlos Ramisch,et al.  Multiword Expressions Acquisition: A Generic and Open Framework , 2014 .

[7]  Mark A. Finlayson,et al.  jMWE: A Java Toolkit for Detecting Multi-Word Expressions , 2011, MWE@ACL.

[8]  Alison Wray Formulaic Language and the Lexicon: Formulaic Language and the Lexicon , 2002 .

[9]  Noah A. Smith,et al.  Comprehensive Annotation of Multiword Expressions in a Social Web Corpus , 2014, LREC.

[10]  Taku Kudo,et al.  MeCab : Yet Another Part-of-Speech and Morphological Analyzer , 2005 .

[11]  Thomas Wasov,et al.  Postverbal behavior , 2002, CSLI lecture notes series.

[12]  Kevin Duh,et al.  Managing information disparity in multilingual document collections , 2013, TSLP.

[13]  Carlos Ramisch,et al.  Fast and Flexible MWE Candidate Generation with the mwetoolkit , 2011, MWE@ACL.

[14]  Graeme Hirst,et al.  Building a Lexicon of Formulaic Language for Language Learners , 2015, MWE@NAACL-HLT.

[15]  Graeme Hirst,et al.  Unsupervised Multiword Segmentation of Large Corpora using Prediction-Driven Decomposition of n-grams , 2014, COLING.

[16]  D. Biber,et al.  If you look at …: Lexical Bundles in University Teaching and Textbooks , 2004 .

[17]  Trevor Cohen,et al.  Empirical distributional semantics: Methods and biomedical applications , 2009, J. Biomed. Informatics.

[18]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[19]  Timothy Baldwin,et al.  On collocations and topic models , 2013, TSLP.

[20]  Daniel Jurafsky,et al.  Is Knowledge-Free Induction of Multiword Unit Dictionary Headwords a Solved Problem? , 2001, EMNLP.

[21]  Christian Biemann,et al.  Impact of MWE Resources on Multiword Recognition , 2016, MWE@ACL.

[22]  Nikola Ljubesic,et al.  Lemmatization and Morphosyntactic Tagging of Croatian and Serbian , 2013, BSNLP@ACL.

[23]  Daisuke Kawahara,et al.  Construction of Japanese Idiom Corpus and its Application to Japanese Idiom Identification , 2008 .

[24]  Nicole Dehé,et al.  Particle Verbs in English: Syntax, information structure and intonation , 2002 .

[25]  Stefan Evert,et al.  The Statistics of Word Cooccur-rences: Word Pairs and Collocations , 2004 .

[26]  Alexander S. Yeh,et al.  More accurate tests for the statistical significance of result differences , 2000, COLING.

[27]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[28]  Helmut Schmid,et al.  Improvements in Part-of-Speech Tagging with an Application to German , 1999 .

[29]  Yu-Hua Chen,et al.  Lexical Bundles in L1 and L2 Academic Writing. , 2010 .

[30]  Hideki Mima,et al.  Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[31]  Alison Wray,et al.  Formulaic Language: Pushing the Boundaries , 2008 .

[32]  Afsaneh Fazly,et al.  Unsupervised Type and Token Identification of Idiomatic Expressions , 2009, CL.

[33]  Nikola Ljubesic,et al.  {bs,hr,sr}WaC - Web Corpora of Bosnian, Croatian and Serbian , 2014, WaC@EACL.

[34]  Jan Snajder,et al.  Building and Evaluating a Distributional Memory for Croatian , 2013, ACL.

[35]  Matt Post,et al.  Reassessing the Goals of Grammatical Error Correction: Fluency Instead of Grammaticality , 2016, TACL.

[36]  Christian Biemann,et al.  A Single Word is not Enough: Ranking Multiword Expressions Using Distributional Semantics , 2015, EMNLP.

[37]  Pavel Pecina,et al.  Lexical association measures and collocation extraction , 2009, Lang. Resour. Evaluation.

[38]  José Gabriel Pereira Lopes,et al.  Language Independent Automatic Acquisition of Rigid Multiword Units from Unrestricted Text Corpora , 1999 .

[39]  Timothy Baldwin,et al.  Multiword Expressions , 2010, Handbook of Natural Language Processing.

[40]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[41]  Akshay Java,et al.  The ICWSM 2009 Spinn3r Dataset , 2009 .

[42]  Joakim Nivre,et al.  A Transition-Based System for Joint Lexical and Syntactic Analysis , 2016, ACL.

[43]  J. Silva,et al.  A Local Maxima method and a Fair Dispersion Normalization for extracting multi-word units from corpora , 2009 .

[44]  Carlos Ramisch,et al.  A Broad Evaluation of Techniques for Automatic Acquisition of Multiword Expressions , 2012, ACL 2012.

[45]  Daisuke Kawahara,et al.  A Large-Scale Web Data Collection as a Natural Language Processing Infrastructure , 2008, LREC.