MWEs and Topic Modelling: Enhancing Machine Learning with Linguistics

Topic modelling is a popular approach to joint clustering of documents and terms, e.g. via Latent Dirichlet Allocation. The standard document representation in topic modelling is a bag of unigrams, ignoring both macro-level document structure and micro-level constituent structure. In this talk, I will discuss recent work on consolidating the micro-level document representation with multiword expressions, and present experimental results which demonstrate that linguistically-richer document representations enhance topic modelling.

[1]  Eva Forsbom,et al.  Training a super model look-alike , 2003, MTSUMMIT.

[2]  Timothy Baldwin,et al.  Multiword Expressions : Some Problems for Japanese NLP , 2002 .

[3]  Eric Wehrli,et al.  Le problème des collocations en TAL , 2006 .

[4]  Sébastien Paumier De la reconnaissance de formes linguistiques à l'analyse syntaxique. (From Pattern Matching in Text to Syntactic Parsing) , 2003 .

[5]  R. Sinha,et al.  Machine Translation of Bi-lingual Hindi-English (Hinglish) Text , 2005, MTSUMMIT.

[6]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[7]  R. Mahesh K. Sinha Mining Complex Predicates In Hindi Using A Parallel Hindi-English Corpus , 2009, MWE@IJCNLP.

[8]  Pavel Rychlý,et al.  Manatee, Bonito and Word Sketches for Czech , 2004 .

[9]  Darren Pearce A Comparative Evaluation of Collocation Extraction Techniques , 2002, LREC.

[10]  Timothy Baldwin,et al.  Multiword expressions: linguistic precision and reusability , 2002, LREC.

[11]  David Yarowsky,et al.  One Sense per Collocation , 1993, HLT.

[12]  Eneko Agirre,et al.  Word Sense Disambiguation: Algorithms and Applications (Text, Speech and Language Technology) , 2006 .

[13]  Ted Pedersen,et al.  Word Sense Discrimination by Clustering Contexts in Vector and Similarity Spaces , 2004, CoNLL.

[14]  Yves Scherrer,et al.  Deep Linguistic Multilingual Translation and Bilingual Dictionaries , 2009, WMT@EACL.

[15]  Carlos Ramisch,et al.  Towards the Construction of Language Resources for Greek Multiword Expressions: Extraction and Evaluation , 2010, LREC 2010.

[16]  Jim Breen,et al.  JMdict: a Japanese-Multilingual Dictionary , 2004 .

[17]  Eric Laporte,et al.  A French Corpus Annotated for Multiword Expressions with Adverbial Function , 2008, LAW II 2008.

[18]  Jonas Kuhn,et al.  Exploiting Translational Correspondences for Pattern-Independent MWE Identification , 2009, MWE@IJCNLP.

[19]  Satoshi Shirai,et al.  Toward an MT System without Pre-Editing - Effects of New Methods in ALT-J/E - , 1995, ArXiv.

[20]  Yuji Matsumoto,et al.  Combining resources for open source machine translation , 2007, TMI.

[21]  Setsuo Yamada,et al.  Corpus-Assisted Expansion of Manual MT Knowledge , 2002 .

[22]  Stefan Evert,et al.  The Statistics of Word Cooccur-rences: Word Pairs and Collocations , 2004 .

[23]  Eric Wehrli,et al.  Fips, A “Deep” Linguistic Multilingual Parser , 2007, ACL 2007.

[24]  J. Murray Oxford Collocations Dictionary for Students of English , 2003 .

[25]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[26]  T. Mohanan Argument structure in Hindi , 1994 .

[27]  Björn-Olav Dozo,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010 .

[28]  Carlos Ramisch,et al.  Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World, MWE@ACL 2011, Portland, Oregon, USA, June 23, 2011 , 2011, MWE@ACL.

[29]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[30]  Adam Kilgarriff,et al.  Large Linguistically-Processed Web Corpora for Multiple Languages , 2006, EACL.

[31]  S. C. Kohs,et al.  The vocabulary test as a measure of intelligence. , 1918 .

[32]  Dawn Archer,et al.  Extracting Multiword Expressions with A Semantic Tagger , 2003, ACL 2003.

[33]  Suzanne Stevenson,et al.  Statistical Measures of the Semi-Productivity of Light Verb Constructions , 2004 .

[34]  Yuji Matsumoto,et al.  Applying Conditional Random Fields to Japanese Morphological Analysis , 2004, EMNLP.

[35]  Ray Jackendoff,et al.  The Architecture of the Language Faculty , 1996 .

[36]  P. McCullagh Estimating the Number of Unseen Species: How Many Words did Shakespeare Know? , 2008 .

[37]  L. Danlos,et al.  Translation in the predicative element of a sentence: category switiching, aspect and diathesis , 1992, TMI.

[38]  Hugh E. Williams,et al.  The Zettair Search Engine , 1998 .

[39]  Timothy Baldwin,et al.  An Empirical Model of Multiword Expression Decomposability , 2003, ACL 2003.

[40]  Peter Edwin Hook,et al.  The compound verb in Hindi , 1976 .

[41]  Samuel Reese,et al.  FreeLing 2.1: Five Years of Open-source Language Processing Tools , 2010, LREC.

[42]  David Yarowsky,et al.  Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation , 2011, ACL.

[43]  Pushpak Bhattacharyya,et al.  Hindi Compound Verbs and their Automatic Extraction , 2008, COLING.

[44]  Mark Dras,et al.  Automatic Identification of Support Verbs: A Step Towards a Definition of Semantic Weight , 1995, ArXiv.

[45]  Stefan Evert,et al.  Using small random samples for the manual evaluation of statistical association measures , 2005, Comput. Speech Lang..

[46]  Eric Wehrli,et al.  Creating a multilingual collocations dictionary from large text corpora , 2003, EACL.

[47]  David Wible,et al.  A Method for Unsupervised Broad-Coverage Lexical Error Detection and Correction , 2009, BEA@NAACL.

[48]  Victoria Arranz,et al.  Multiwords and Word Sense Disambiguation , 2005, CICLing.

[49]  Aravind K. Joshi,et al.  Relative Compositionality of Multi-word Expressions: A Study of Verb-Noun (V-N) Collocations , 2005, IJCNLP.

[50]  O. Jespersen A modern English grammar on historical principles , 1928 .

[51]  Dan I. Moldovan,et al.  Word sense disambiguation of WordNet glosses , 2004, Comput. Speech Lang..

[52]  Eric Nichols,et al.  Deep open-source machine translation , 2011, Machine Translation.

[53]  Kenneth Ward Church,et al.  Text Analysis and Word Pronunciation in Text-to-speech Synthesis , 2013 .

[54]  Aline Villavicencio,et al.  Statistically-Driven Alignment-Based Multiword Expression Identification for Technical Domains , 2009, MWE@IJCNLP.

[55]  Yves Lepage,et al.  Sampling-based Multilingual Alignment , 2009, RANLP.

[56]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[57]  Tim van de Cruys,et al.  Semantics-based Multiword Expression Extraction , 2007 .

[58]  Violeta Seretan,et al.  An integrated environment for extracting and translating collocations , 2009 .

[59]  Violeta Seretan,et al.  Syntax-Based Extraction , 2011 .

[60]  Oliver Christ,et al.  A Modular and Flexible Architecture for an Integrated Corpus Query System , 1994, ArXiv.

[61]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[62]  Iris Hendrickx,et al.  Complex Predicates Annotation in a Corpus of Portuguese , 2010, Linguistic Annotation Workshop.

[63]  Mark Steedman,et al.  The syntactic process , 2004, Language, speech, and communication.

[64]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[65]  Tanja Samardžić,et al.  Cross-Lingual Variation of Light Verb Constructions: Using Parallel Corpora and Automatic Alignment for Linguistic Research , 2010 .

[66]  Morris Salkoff,et al.  Automatic translation of support verb constructions , 1990, COLING.

[67]  Adam Kilgarriff,et al.  The Sketch Engine , 2004 .

[68]  Kenneth Ward Church,et al.  Morphology and rhyming: two powerful alternatives to letter-to-sound rules for speech synthesis , 1990, SSW.

[69]  Kim Nam Su,et al.  Statistical modeling of multiword expressions , 2008 .

[70]  Aline Villavicencio,et al.  UFRGS@CLEF2008: Indexing Multiword Expressions for Information Retrieval , 2008, CLEF.

[71]  Chris Callison-Burch,et al.  Scaling Phrase-Based Statistical Machine Translation to Larger Corpora and Longer Phrases , 2005, ACL.

[72]  Frederick Jelinek,et al.  Some of my Best Friends are Linguists , 2005, Lang. Resour. Evaluation.

[73]  Richard Sproat English noun-phrase accent prediction for text-to-speech , 1994, Comput. Speech Lang..

[74]  M. Barlow ParaConc : Concordance Software for Multilingual Parallel Corpora , 2002 .

[75]  Mark Johnson,et al.  Unsupervised learning of multi-word verbs , 2001 .

[76]  Paul Procter,et al.  Cambridge international dictionary of English , 2000 .

[77]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[78]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[79]  Michele Banko,et al.  Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.

[80]  David Wible,et al.  StringNet as a Computational Resource for Discovering and Investigating Linguistic Constructions , 2010, HLT-NAACL 2010.

[81]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[82]  Timothy Baldwin,et al.  Interpretation of Compound Nominalisations using Corpus and Web Statistics , 2006 .

[83]  James Rogers Capturing CFLs with Tree Adjoining Grammars , 1994, ACL.

[84]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[85]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[86]  Aravind K. Joshi,et al.  Using Information about Multi-word Expressions for the Word-Alignment Task , 2006 .

[87]  Masaki Murata,et al.  Multilingual Aligned Parallel Treebank Corpus Reflecting Contextual Information and Its Applications , 2004 .

[88]  J. Silva,et al.  A Local Maxima method and a Fair Dispersion Normalization for extracting multi-word units from corpora , 2009 .

[89]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[90]  Jan Tore Lønning,et al.  Towards hybrid quality-oriented machine translation – on linguistics and probabilities in MT , 2007, TMI.

[91]  Anabela Barreiro,et al.  ReEscreve: a translator-friendly multi-purpose paraphrasing software tool , 2009 .

[92]  José Gabriel Pereira Lopes,et al.  Language Independent Automatic Acquisition of Rigid Multiword Units from Unrestricted Text Corpora , 1999 .

[93]  Stefan Evert,et al.  Experiments on Candidate Data for Collocation Extraction , 2003, EACL.

[94]  Timothy Baldwin,et al.  Multiword Expressions , 2010, Handbook of Natural Language Processing.

[95]  Kenneth Ward Church,et al.  Enhanced Good-Turing and Cat-Cal: Two New Methods for Estimating Probabilities of English Bigrams (abbreviated version) , 1989, HLT.

[96]  Uri Zernik,et al.  Lexical acquisition: Exploiting on-line resources to build a lexicon. , 1991 .

[97]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[98]  Carlos Ramisch,et al.  mwetoolkit: a Framework for Multiword Expression Identification , 2010, LREC.

[99]  Eric Brill,et al.  Spelling Correction as an Iterative Process that Exploits the Collective Knowledge of Web Users , 2004, EMNLP.

[100]  Adam Kilgarriff,et al.  Language is never, ever, ever, random , 2005 .

[101]  Beatrice Daille,et al.  Combined approach for terminology extraction: lexical statistics and linguistic filtering , 1995 .

[102]  Mike Scott Wordsmith Tools version 3 , 1997 .

[103]  Kenneth Ward Church,et al.  Termight: Identifying and Translating Technical Terminology , 1994, ANLP.

[104]  Timothy Baldwin,et al.  Extracting the Unextractable: A Case Study on Verb-particles , 2002, CoNLL.

[105]  Carlos Ramisch,et al.  Web-based and combined language models: a case study on noun compound identification , 2010, COLING.

[106]  Stefan Langer,et al.  A Formal Specification of Support Verb Constructions , 2009 .

[107]  Joakim Nivre,et al.  Multiword Units in Syntactic Parsing , 2004 .

[108]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[109]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[110]  Aline Villavicencio,et al.  Introduction to the special issue on multiword expressions: Having a crack at a hard nut , 2005, Comput. Speech Lang..

[111]  L. F. L. Cintra,et al.  Crónica geral de Espanha de 1344 , 1952 .

[112]  Y. Tanaka,et al.  Compilation of a multilingual parallel corpus , 2001 .

[113]  Angelika Storrer,et al.  Multiword Lexemes: A Monolingual and Contrastive Typology for NLP and MT , 1992, IWBS Report.

[114]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[115]  Donald R. Morrison,et al.  PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[116]  Pavel Pecina Lexical Association Measures: Collocation Extraction , 2008 .

[117]  Archna Bhatia,et al.  PropBank Annotation of Multilingual Light Verb Constructions , 2010, Linguistic Annotation Workshop.

[118]  Stefan Langer,et al.  A linguistic test battery for support verb constructions , 2004 .

[119]  Karen Sparck Jones What is the Role of NLP in Text Retrieval , 1999 .

[120]  Andreas Stolcke,et al.  Entropy-based Pruning of Backoff Language Models , 2000, ArXiv.

[121]  Aline Villavicencio,et al.  Lexical Encoding of MWEs , 2004 .

[122]  Paul Rayson Wmatrix : a web-based corpus processing environment , 2022 .

[123]  Christopher R. Johnson,et al.  Lexicographic Relevance: Selecting Information From Corpus Evidence , 2003 .

[124]  German Rigau,et al.  The TALP systems for disambiguating WordNet glosses , 2004, SENSEVAL@ACL.

[125]  Pavel Pecina,et al.  Lexical association measures and collocation extraction , 2009, Lang. Resour. Evaluation.

[126]  Hilda Monetto Flores da Silva VERBOS-SUPORTE OU EXPRESSÕES LEXICALIZADAS? , 2009 .

[127]  Sue Atkins The DANTE Database: Its Contribution to English Lexical Research, and in Particular to Complementing the FrameNet Data , 2010, A Way with Words.

[128]  Graça Rio-Torto,et al.  O Léxico : semântica e gramática das unidades lexicais , 2006 .

[129]  Shuly Wintner,et al.  Identifying Multi-word Expressions by Leveraging Morphological and Syntactic Idiosyncrasy , 2010, COLING.

[130]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[131]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[132]  Simone Teufel,et al.  Corpus-based Method for Automatic Identification of Support Verbs for Nominalizations , 1995, EACL.

[133]  Amitabha Mukerjee,et al.  Detecting Complex Predicates in Hindi using POS Projection across Parallel Corpora , 2006 .

[134]  Simon Charest,et al.  Élaboration automatique d’un dictionnaire de cooccurrences grand public , 2007, JEPTALNRECITAL.

[135]  Carlos Ramisch,et al.  Alignment-based extraction of multiword expressions , 2010, Lang. Resour. Evaluation.

[136]  Dan Flickinger,et al.  Minimal Recursion Semantics: An Introduction , 2005 .

[137]  David Yarowsky,et al.  One Sense Per Discourse , 1992, HLT.

[138]  Carlos Ramisch,et al.  Multiword Expressions in the wild? The mwetoolkit comes in handy , 2010, COLING.

[139]  Gerlof Bouma Collocation Extraction beyond the Independence Assumption , 2010, ACL.

[140]  Pushpak Bhattacharyya,et al.  Verbs are where all the action lies: Experiences of Shallow Parsing of a Morphologically Rich Language , 2010, COLING.

[141]  M. Tomasello Regularity and Idiomaticity in Grammatical Constructions: The Case of Let Alone , 2003 .

[142]  Aravind K. Joshi,et al.  Tree-Adjoining Grammars , 1997, Handbook of Formal Languages.

[143]  Ken Ward Church,et al.  Using Word-Sense Disambiguation Methods to Classify Web Queries by Intent , 2009, EMNLP.

[144]  Miriam Butt The Structure of Complex Predicates in Urdu , 1995 .

[145]  Ralph Grishman,et al.  Towards Best Practice for Multiword Expressions in Computational Lexicons , 2002, LREC.

[146]  Jörg Tiedemann,et al.  Identifying idiomatic expressions using automatic word-alignment , 2006 .

[147]  Afsaneh Fazly,et al.  Pulling their Weight: Exploiting Syntactic Forms for the Automatic Identification of Idiomatic Expressions in Context , 2007 .

[148]  Stefan Evert,et al.  Methods for the Qualitative Evaluation of Lexical Association Measures , 2001, ACL.

[149]  Kenneth Ward Church Empirical Estimates of Adaptation: The chance of Two Noriegas is closer to p/2 than p2 , 2000, COLING.

[150]  Ataliba Teixeira de Castilho,et al.  Grámatica do português falado , 1990 .

[151]  R. Mahesh K. Sinha Learning Disambiguation of Hindi Morpheme "vaalaa' with a Sparse Corpus , 2009, 2009 International Conference on Machine Learning and Applications.

[152]  K. Sinha,et al.  Dealing with Replicative Words in Hindi for Machine Translation to English , 2005, MTSUMMIT.

[153]  Yi Zhang,et al.  Towards Domain-Independent Deep Linguistic Processing: Ensuring Portability and Re-Usability of Lexicalised Grammars , 2008, COLING 2008.

[154]  Yuji Matsumoto,et al.  Feedback Cleaning of Machine Translation Rules Using Automatic Evaluation , 2003, ACL.

[155]  Miriam Butt The Light Verb Jungle , 2003 .

[156]  Andy Way A hybrid architecture for robust MT using LFG-DOP , 1999, J. Exp. Theor. Artif. Intell..

[157]  Pavel Pecina,et al.  Combining Association Measures for Collocation Extraction , 2006, ACL.

[158]  Colin Bannard A Measure of Syntactic Flexibility for Automatically Identifying Multiword Expressions in Corpora , 2007 .

[159]  Dustin Boswell UCSD Research Exam (Summer 2004) "Speling Korecksion: A Survey of Techniques from Past to Present" (Final Draft). , 2005 .

[160]  Satanjeev Banerjee,et al.  The Design, Implementation, and Use of the Ngram Statistics Package , 2003, CICLing.

[161]  Hang Cui,et al.  Extending corpus-based identification of light verb constructions using a supervised learning framework , 2006 .

[162]  Mark A. Finlayson,et al.  Source code and data for MWE'2011 papers , 2011 .

[163]  Ellen M. Voorhees,et al.  Evaluating Evaluation Measure Stability , 2000, SIGIR 2000.

[164]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[165]  Joseph D. Becker The Phrasal Lexicon , 1975, TINLAP.

[166]  Y. Bar-Hillel A Quasi-Arithmetical Notation for Syntactic Description , 1953 .

[167]  Doug Beeferman,et al.  Say what? why users choose to speak their web queries , 2010, INTERSPEECH.

[168]  Timothy Baldwin,et al.  Road-testing the English Resource Grammar Over the British National Corpus , 2004, LREC.

[169]  Ted Pedersen,et al.  Significant Lexical Relationships , 1996, AAAI/IAAI, Vol. 1.

[170]  Satoshi Shirai,et al.  Construction of a Dictionary for Translating Japanese Phrases into One English Word , 2001 .

[171]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[172]  Aline Villavicencio,et al.  Automated Multiword Expression Prediction for Grammar Engineering , 2006 .

[173]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.