论文信息 - Unsupervised Acquisition of Comprehensive Multiword Lexicons using Competition in an n-gram Lattice - 字舞流文

Unsupervised Acquisition of Comprehensive Multiword Lexicons using Competition in an n-gram Lattice

We present a new model for acquiring comprehensive multiword lexicons from large corpora based on competition among n-gram candidates. In contrast to the standard approach of simple ranking by association measure, in our model n-grams are arranged in a lattice structure based on subsumption and overlap relationships, with nodes inhibiting other nodes in their vicinity when they are selected as a lexical item. We show how the configuration of such a lattice can be optimized tractably, and demonstrate using annotations of sampled n-grams that our method consistently outperforms alternatives by at least 0.05 F-score across several corpora and languages.

Timothy Baldwin | Julian Brooke | Jan Snajder | J. Šnajder | Timothy Baldwin | Julian Brooke

[1] Noah A. Smith,et al. Discriminative Lexical Semantic Segmentation with Gaps: Running the MWE Gamut , 2014, TACL.

[2] Sylviane Granger,et al. The use of collocations by intermediate vs. advanced non-native writers: A bigram-based study , 2014 .

[3] Timothy Baldwin,et al. Bayesian Text Segmentation for Index Term Identification and Keyphrase Extraction , 2012, COLING.

[4] Franziska Frankfurter,et al. Constructions: A construction grammar approach to argument structure: Adele E. Goldberg, Chicago, IL: The University of Chicago Press, 1995. xi + 265 pp , 1998 .

[5] T. Griffiths,et al. A Bayesian framework for word segmentation: Exploring the effects of context , 2009, Cognition.

[6] Carlos Ramisch,et al. Multiword Expressions Acquisition: A Generic and Open Framework , 2014 .

[7] Mark A. Finlayson,et al. jMWE: A Java Toolkit for Detecting Multi-Word Expressions , 2011, MWE@ACL.

[8] Alison Wray. Formulaic Language and the Lexicon: Formulaic Language and the Lexicon , 2002 .

[9] Noah A. Smith,et al. Comprehensive Annotation of Multiword Expressions in a Social Web Corpus , 2014, LREC.

[10] Taku Kudo,et al. MeCab : Yet Another Part-of-Speech and Morphological Analyzer , 2005 .

[11] Thomas Wasov,et al. Postverbal behavior , 2002, CSLI lecture notes series.

[12] Kevin Duh,et al. Managing information disparity in multilingual document collections , 2013, TSLP.

[13] Carlos Ramisch,et al. Fast and Flexible MWE Candidate Generation with the mwetoolkit , 2011, MWE@ACL.

[14] Graeme Hirst,et al. Building a Lexicon of Formulaic Language for Language Learners , 2015, MWE@NAACL-HLT.

[15] Graeme Hirst,et al. Unsupervised Multiword Segmentation of Large Corpora using Prediction-Driven Decomposition of n-grams , 2014, COLING.

[16] D. Biber,et al. If you look at …: Lexical Bundles in University Teaching and Textbooks , 2004 .

[17] Trevor Cohen,et al. Empirical distributional semantics: Methods and biomedical applications , 2009, J. Biomed. Informatics.

[18] Kenneth Ward Church,et al. Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[19] Timothy Baldwin,et al. On collocations and topic models , 2013, TSLP.

[20] Daniel Jurafsky,et al. Is Knowledge-Free Induction of Multiword Unit Dictionary Headwords a Solved Problem? , 2001, EMNLP.

[21] Christian Biemann,et al. Impact of MWE Resources on Multiword Recognition , 2016, MWE@ACL.

[22] Nikola Ljubesic,et al. Lemmatization and Morphosyntactic Tagging of Croatian and Serbian , 2013, BSNLP@ACL.

[23] Daisuke Kawahara,et al. Construction of Japanese Idiom Corpus and its Application to Japanese Idiom Identification , 2008 .

[24] Nicole Dehé,et al. Particle Verbs in English: Syntax, information structure and intonation , 2002 .

[25] Stefan Evert,et al. The Statistics of Word Cooccur-rences: Word Pairs and Collocations , 2004 .

[26] Alexander S. Yeh,et al. More accurate tests for the statistical significance of result differences , 2000, COLING.

[27] Timothy Baldwin,et al. Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[28] Helmut Schmid,et al. Improvements in Part-of-Speech Tagging with an Application to German , 1999 .

[29] Yu-Hua Chen,et al. Lexical Bundles in L1 and L2 Academic Writing. , 2010 .

[30] Hideki Mima,et al. Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[31] Alison Wray,et al. Formulaic Language: Pushing the Boundaries , 2008 .

[32] Afsaneh Fazly,et al. Unsupervised Type and Token Identification of Idiomatic Expressions , 2009, CL.

[33] Nikola Ljubesic,et al. {bs,hr,sr}WaC - Web Corpora of Bosnian, Croatian and Serbian , 2014, WaC@EACL.

[34] Jan Snajder,et al. Building and Evaluating a Distributional Memory for Croatian , 2013, ACL.

[35] Matt Post,et al. Reassessing the Goals of Grammatical Error Correction: Fluency Instead of Grammaticality , 2016, TACL.

[36] Christian Biemann,et al. A Single Word is not Enough: Ranking Multiword Expressions Using Distributional Semantics , 2015, EMNLP.

[37] Pavel Pecina,et al. Lexical association measures and collocation extraction , 2009, Lang. Resour. Evaluation.

[38] José Gabriel Pereira Lopes,et al. Language Independent Automatic Acquisition of Rigid Multiword Units from Unrestricted Text Corpora , 1999 .

[39] Timothy Baldwin,et al. Multiword Expressions , 2010, Handbook of Natural Language Processing.

[40] Ted Dunning,et al. Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[41] Akshay Java,et al. The ICWSM 2009 Spinn3r Dataset , 2009 .

[42] Joakim Nivre,et al. A Transition-Based System for Joint Lexical and Syntactic Analysis , 2016, ACL.

[43] J. Silva,et al. A Local Maxima method and a Fair Dispersion Normalization for extracting multi-word units from corpora , 2009 .

[44] Carlos Ramisch,et al. A Broad Evaluation of Techniques for Automatic Acquisition of Multiword Expressions , 2012, ACL 2012.

[45] Daisuke Kawahara,et al. A Large-Scale Web Data Collection as a Natural Language Processing Infrastructure , 2008, LREC.