论文信息 - MWEs and Topic Modelling: Enhancing Machine Learning with Linguistics - 字舞流文

MWEs and Topic Modelling: Enhancing Machine Learning with Linguistics

Topic modelling is a popular approach to joint clustering of documents and terms, e.g. via Latent Dirichlet Allocation. The standard document representation in topic modelling is a bag of unigrams, ignoring both macro-level document structure and micro-level constituent structure. In this talk, I will discuss recent work on consolidating the micro-level document representation with multiword expressions, and present experimental results which demonstrate that linguistically-richer document representations enhance topic modelling.

Timothy Baldwin | Timothy Baldwin

[1] Eva Forsbom,et al. Training a super model look-alike , 2003, MTSUMMIT.

[2] Timothy Baldwin,et al. Multiword Expressions : Some Problems for Japanese NLP , 2002 .

[3] Eric Wehrli,et al. Le problème des collocations en TAL , 2006 .

[4] Sébastien Paumier. De la reconnaissance de formes linguistiques à l'analyse syntaxique. (From Pattern Matching in Text to Syntactic Parsing) , 2003 .

[5] R. Sinha,et al. Machine Translation of Bi-lingual Hindi-English (Hinglish) Text , 2005, MTSUMMIT.

[6] Dan Klein,et al. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[7] R. Mahesh K. Sinha. Mining Complex Predicates In Hindi Using A Parallel Hindi-English Corpus , 2009, MWE@IJCNLP.

[8] Pavel Rychlý,et al. Manatee, Bonito and Word Sketches for Czech , 2004 .

[9] Darren Pearce. A Comparative Evaluation of Collocation Extraction Techniques , 2002, LREC.

[10] Timothy Baldwin,et al. Multiword expressions: linguistic precision and reusability , 2002, LREC.

[11] David Yarowsky,et al. One Sense per Collocation , 1993, HLT.

[12] Eneko Agirre,et al. Word Sense Disambiguation: Algorithms and Applications (Text, Speech and Language Technology) , 2006 .

[13] Ted Pedersen,et al. Word Sense Discrimination by Clustering Contexts in Vector and Similarity Spaces , 2004, CoNLL.

[14] Yves Scherrer,et al. Deep Linguistic Multilingual Translation and Bilingual Dictionaries , 2009, WMT@EACL.

[15] Carlos Ramisch,et al. Towards the Construction of Language Resources for Greek Multiword Expressions: Extraction and Evaluation , 2010, LREC 2010.

[16] Jim Breen,et al. JMdict: a Japanese-Multilingual Dictionary , 2004 .

[17] Eric Laporte,et al. A French Corpus Annotated for Multiword Expressions with Adverbial Function , 2008, LAW II 2008.

[18] Jonas Kuhn,et al. Exploiting Translational Correspondences for Pattern-Independent MWE Identification , 2009, MWE@IJCNLP.

[19] Satoshi Shirai,et al. Toward an MT System without Pre-Editing - Effects of New Methods in ALT-J/E - , 1995, ArXiv.

[20] Yuji Matsumoto,et al. Combining resources for open source machine translation , 2007, TMI.

[21] Setsuo Yamada,et al. Corpus-Assisted Expansion of Manual MT Knowledge , 2002 .

[22] Stefan Evert,et al. The Statistics of Word Cooccur-rences: Word Pairs and Collocations , 2004 .

[23] Eric Wehrli,et al. Fips, A “Deep” Linguistic Multilingual Parser , 2007, ACL 2007.

[24] J. Murray. Oxford Collocations Dictionary for Students of English , 2003 .

[25] Timothy Baldwin,et al. Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[26] T. Mohanan. Argument structure in Hindi , 1994 .

[27] Björn-Olav Dozo,et al. Quantitative Analysis of Culture Using Millions of Digitized Books , 2010 .

[28] Carlos Ramisch,et al. Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World, MWE@ACL 2011, Portland, Oregon, USA, June 23, 2011 , 2011, MWE@ACL.

[29] Hermann Ney,et al. A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[30] Adam Kilgarriff,et al. Large Linguistically-Processed Web Corpora for Multiple Languages , 2006, EACL.

[31] S. C. Kohs,et al. The vocabulary test as a measure of intelligence. , 1918 .

[32] Dawn Archer,et al. Extracting Multiword Expressions with A Semantic Tagger , 2003, ACL 2003.

[33] Suzanne Stevenson,et al. Statistical Measures of the Semi-Productivity of Light Verb Constructions , 2004 .

[34] Yuji Matsumoto,et al. Applying Conditional Random Fields to Japanese Morphological Analysis , 2004, EMNLP.

[35] Ray Jackendoff,et al. The Architecture of the Language Faculty , 1996 .

[36] P. McCullagh. Estimating the Number of Unseen Species: How Many Words did Shakespeare Know? , 2008 .

[37] L. Danlos,et al. Translation in the predicative element of a sentence: category switiching, aspect and diathesis , 1992, TMI.

[38] Hugh E. Williams,et al. The Zettair Search Engine , 1998 .

[39] Timothy Baldwin,et al. An Empirical Model of Multiword Expression Decomposability , 2003, ACL 2003.

[40] Peter Edwin Hook,et al. The compound verb in Hindi , 1976 .

[41] Samuel Reese,et al. FreeLing 2.1: Five Years of Open-source Language Processing Tools , 2010, LREC.

[42] David Yarowsky,et al. Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation , 2011, ACL.

[43] Pushpak Bhattacharyya,et al. Hindi Compound Verbs and their Automatic Extraction , 2008, COLING.

[44] Mark Dras,et al. Automatic Identification of Support Verbs: A Step Towards a Definition of Semantic Weight , 1995, ArXiv.

[45] Stefan Evert,et al. Using small random samples for the manual evaluation of statistical association measures , 2005, Comput. Speech Lang..

[46] Eric Wehrli,et al. Creating a multilingual collocations dictionary from large text corpora , 2003, EACL.

[47] David Wible,et al. A Method for Unsupervised Broad-Coverage Lexical Error Detection and Correction , 2009, BEA@NAACL.

[48] Victoria Arranz,et al. Multiwords and Word Sense Disambiguation , 2005, CICLing.

[49] Aravind K. Joshi,et al. Relative Compositionality of Multi-word Expressions: A Study of Verb-Noun (V-N) Collocations , 2005, IJCNLP.

[50] O. Jespersen. A modern English grammar on historical principles , 1928 .

[51] Dan I. Moldovan,et al. Word sense disambiguation of WordNet glosses , 2004, Comput. Speech Lang..

[52] Eric Nichols,et al. Deep open-source machine translation , 2011, Machine Translation.

[53] Kenneth Ward Church,et al. Text Analysis and Word Pronunciation in Text-to-speech Synthesis , 2013 .

[54] Aline Villavicencio,et al. Statistically-Driven Alignment-Based Multiword Expression Identification for Technical Domains , 2009, MWE@IJCNLP.

[55] Yves Lepage,et al. Sampling-based Multilingual Alignment , 2009, RANLP.

[56] Christiane Fellbaum,et al. Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[57] Tim van de Cruys,et al. Semantics-based Multiword Expression Extraction , 2007 .

[58] Violeta Seretan,et al. An integrated environment for extracting and translating collocations , 2009 .

[59] Violeta Seretan,et al. Syntax-Based Extraction , 2011 .

[60] Oliver Christ,et al. A Modular and Flexible Architecture for an Integrated Corpus Query System , 1994, ArXiv.

[61] Karen Kukich,et al. Techniques for automatically correcting words in text , 1992, CSUR.

[62] Iris Hendrickx,et al. Complex Predicates Annotation in a Corpus of Portuguese , 2010, Linguistic Annotation Workshop.

[63] Mark Steedman,et al. The syntactic process , 2004, Language, speech, and communication.

[64] Ian H. Witten,et al. Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[65] Tanja Samardžić,et al. Cross-Lingual Variation of Light Verb Constructions: Using Parallel Corpora and Automatic Alignment for Linguistic Research , 2010 .

[66] Morris Salkoff,et al. Automatic translation of support verb constructions , 1990, COLING.

[67] Adam Kilgarriff,et al. The Sketch Engine , 2004 .

[68] Kenneth Ward Church,et al. Morphology and rhyming: two powerful alternatives to letter-to-sound rules for speech synthesis , 1990, SSW.

[69] Kim Nam Su,et al. Statistical modeling of multiword expressions , 2008 .

[70] Aline Villavicencio,et al. UFRGS@CLEF2008: Indexing Multiword Expressions for Information Retrieval , 2008, CLEF.

[71] Chris Callison-Burch,et al. Scaling Phrase-Based Statistical Machine Translation to Larger Corpora and Longer Phrases , 2005, ACL.

[72] Frederick Jelinek,et al. Some of my Best Friends are Linguists , 2005, Lang. Resour. Evaluation.

[73] Richard Sproat. English noun-phrase accent prediction for text-to-speech , 1994, Comput. Speech Lang..

[74] M. Barlow. ParaConc : Concordance Software for Multilingual Parallel Corpora , 2002 .

[75] Mark Johnson,et al. Unsupervised learning of multi-word verbs , 2001 .

[76] Paul Procter,et al. Cambridge international dictionary of English , 2000 .

[77] Ian H. Witten,et al. The WEKA data mining software: an update , 2009, SKDD.

[78] Andrei Broder,et al. A taxonomy of web search , 2002, SIGF.

[79] Michele Banko,et al. Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.

[80] David Wible,et al. StringNet as a Computational Resource for Discovering and Investigating Linguistic Constructions , 2010, HLT-NAACL 2010.

[81] Helmut Schmidt,et al. Probabilistic part-of-speech tagging using decision trees , 1994 .

[82] Timothy Baldwin,et al. Interpretation of Compound Nominalisations using Corpus and Web Statistics , 2006 .

[83] James Rogers. Capturing CFLs with Tree Adjoining Grammars , 1994, ACL.

[84] F. Mosteller,et al. Inference and Disputed Authorship: The Federalist , 1966 .

[85] George A. Miller,et al. WordNet: A Lexical Database for English , 1995, HLT.

[86] Aravind K. Joshi,et al. Using Information about Multi-word Expressions for the Word-Alignment Task , 2006 .

[87] Masaki Murata,et al. Multilingual Aligned Parallel Treebank Corpus Reflecting Contextual Information and Its Applications , 2004 .

[88] J. Silva,et al. A Local Maxima method and a Fair Dispersion Normalization for extracting multi-word units from corpora , 2009 .

[89] J. R. Firth,et al. A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[90] Jan Tore Lønning,et al. Towards hybrid quality-oriented machine translation – on linguistics and probabilities in MT , 2007, TMI.

[91] Anabela Barreiro,et al. ReEscreve: a translator-friendly multi-purpose paraphrasing software tool , 2009 .

[92] José Gabriel Pereira Lopes,et al. Language Independent Automatic Acquisition of Rigid Multiword Units from Unrestricted Text Corpora , 1999 .

[93] Stefan Evert,et al. Experiments on Candidate Data for Collocation Extraction , 2003, EACL.

[94] Timothy Baldwin,et al. Multiword Expressions , 2010, Handbook of Natural Language Processing.

[95] Kenneth Ward Church,et al. Enhanced Good-Turing and Cat-Cal: Two New Methods for Estimating Probabilities of English Bigrams (abbreviated version) , 1989, HLT.

[96] Uri Zernik,et al. Lexical acquisition: Exploiting on-line resources to build a lexicon. , 1991 .

[97] Leo Breiman,et al. Bagging Predictors , 1996, Machine Learning.

[98] Carlos Ramisch,et al. mwetoolkit: a Framework for Multiword Expression Identification , 2010, LREC.

[99] Eric Brill,et al. Spelling Correction as an Iterative Process that Exploits the Collective Knowledge of Web Users , 2004, EMNLP.

[100] Adam Kilgarriff,et al. Language is never, ever, ever, random , 2005 .

[101] Beatrice Daille,et al. Combined approach for terminology extraction: lexical statistics and linguistic filtering , 1995 .

[102] Mike Scott. Wordsmith Tools version 3 , 1997 .

[103] Kenneth Ward Church,et al. Termight: Identifying and Translating Technical Terminology , 1994, ANLP.

[104] Timothy Baldwin,et al. Extracting the Unextractable: A Case Study on Verb-particles , 2002, CoNLL.

[105] Carlos Ramisch,et al. Web-based and combined language models: a case study on noun compound identification , 2010, COLING.

[106] Stefan Langer,et al. A Formal Specification of Support Verb Constructions , 2009 .

[107] Joakim Nivre,et al. Multiword Units in Syntactic Parsing , 2004 .

[108] Dan Klein,et al. Accurate Unlexicalized Parsing , 2003, ACL.

[109] Ted Dunning,et al. Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[110] Aline Villavicencio,et al. Introduction to the special issue on multiword expressions: Having a crack at a hard nut , 2005, Comput. Speech Lang..

[111] L. F. L. Cintra,et al. Crónica geral de Espanha de 1344 , 1952 .

[112] Y. Tanaka,et al. Compilation of a multilingual parallel corpus , 2001 .

[113] Angelika Storrer,et al. Multiword Lexemes: A Monolingual and Contrastive Typology for NLP and MT , 1992, IWBS Report.

[114] Frank Smadja,et al. Retrieving Collocations from Text: Xtract , 1993, CL.

[115] Donald R. Morrison,et al. PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[116] Pavel Pecina. Lexical Association Measures: Collocation Extraction , 2008 .

[117] Archna Bhatia,et al. PropBank Annotation of Multilingual Light Verb Constructions , 2010, Linguistic Annotation Workshop.

[118] Stefan Langer,et al. A linguistic test battery for support verb constructions , 2004 .

[119] Karen Sparck Jones. What is the Role of NLP in Text Retrieval , 1999 .

[120] Andreas Stolcke,et al. Entropy-based Pruning of Backoff Language Models , 2000, ArXiv.

[121] Aline Villavicencio,et al. Lexical Encoding of MWEs , 2004 .

[122] Paul Rayson. Wmatrix : a web-based corpus processing environment , 2022 .

[123] Christopher R. Johnson,et al. Lexicographic Relevance: Selecting Information From Corpus Evidence , 2003 .

[124] German Rigau,et al. The TALP systems for disambiguating WordNet glosses , 2004, SENSEVAL@ACL.

[125] Pavel Pecina,et al. Lexical association measures and collocation extraction , 2009, Lang. Resour. Evaluation.

[126] Hilda Monetto Flores da Silva. VERBOS-SUPORTE OU EXPRESSÕES LEXICALIZADAS? , 2009 .

[127] Sue Atkins. The DANTE Database: Its Contribution to English Lexical Research, and in Particular to Complementing the FrameNet Data , 2010, A Way with Words.

[128] Graça Rio-Torto,et al. O Léxico : semântica e gramática das unidades lexicais , 2006 .

[129] Shuly Wintner,et al. Identifying Multi-word Expressions by Leveraging Morphological and Syntactic Idiosyncrasy , 2010, COLING.

[130] Kenneth Ward Church,et al. Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[131] Christopher D. Manning,et al. Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[132] Simone Teufel,et al. Corpus-based Method for Automatic Identification of Support Verbs for Nominalizations , 1995, EACL.

[133] Amitabha Mukerjee,et al. Detecting Complex Predicates in Hindi using POS Projection across Parallel Corpora , 2006 .

[134] Simon Charest,et al. Élaboration automatique d’un dictionnaire de cooccurrences grand public , 2007, JEPTALNRECITAL.

[135] Carlos Ramisch,et al. Alignment-based extraction of multiword expressions , 2010, Lang. Resour. Evaluation.

[136] Dan Flickinger,et al. Minimal Recursion Semantics: An Introduction , 2005 .

[137] David Yarowsky,et al. One Sense Per Discourse , 1992, HLT.

[138] Carlos Ramisch,et al. Multiword Expressions in the wild? The mwetoolkit comes in handy , 2010, COLING.

[139] Gerlof Bouma. Collocation Extraction beyond the Independence Assumption , 2010, ACL.

[140] Pushpak Bhattacharyya,et al. Verbs are where all the action lies: Experiences of Shallow Parsing of a Morphologically Rich Language , 2010, COLING.

[141] M. Tomasello. Regularity and Idiomaticity in Grammatical Constructions: The Case of Let Alone , 2003 .

[142] Aravind K. Joshi,et al. Tree-Adjoining Grammars , 1997, Handbook of Formal Languages.

[143] Ken Ward Church,et al. Using Word-Sense Disambiguation Methods to Classify Web Queries by Intent , 2009, EMNLP.

[144] Miriam Butt. The Structure of Complex Predicates in Urdu , 1995 .

[145] Ralph Grishman,et al. Towards Best Practice for Multiword Expressions in Computational Lexicons , 2002, LREC.

[146] Jörg Tiedemann,et al. Identifying idiomatic expressions using automatic word-alignment , 2006 .

[147] Afsaneh Fazly,et al. Pulling their Weight: Exploiting Syntactic Forms for the Automatic Identification of Idiomatic Expressions in Context , 2007 .

[148] Stefan Evert,et al. Methods for the Qualitative Evaluation of Lexical Association Measures , 2001, ACL.

[149] Kenneth Ward Church. Empirical Estimates of Adaptation: The chance of Two Noriegas is closer to p/2 than p2 , 2000, COLING.

[150] Ataliba Teixeira de Castilho,et al. Grámatica do português falado , 1990 .

[151] R. Mahesh K. Sinha. Learning Disambiguation of Hindi Morpheme "vaalaa' with a Sparse Corpus , 2009, 2009 International Conference on Machine Learning and Applications.

[152] K. Sinha,et al. Dealing with Replicative Words in Hindi for Machine Translation to English , 2005, MTSUMMIT.

[153] Yi Zhang,et al. Towards Domain-Independent Deep Linguistic Processing: Ensuring Portability and Re-Usability of Lexicalised Grammars , 2008, COLING 2008.

[154] Yuji Matsumoto,et al. Feedback Cleaning of Machine Translation Rules Using Automatic Evaluation , 2003, ACL.

[155] Miriam Butt. The Light Verb Jungle , 2003 .

[156] Andy Way. A hybrid architecture for robust MT using LFG-DOP , 1999, J. Exp. Theor. Artif. Intell..

[157] Pavel Pecina,et al. Combining Association Measures for Collocation Extraction , 2006, ACL.

[158] Colin Bannard. A Measure of Syntactic Flexibility for Automatically Identifying Multiword Expressions in Corpora , 2007 .

[159] Dustin Boswell. UCSD Research Exam (Summer 2004) "Speling Korecksion: A Survey of Techniques from Past to Present" (Final Draft). , 2005 .

[160] Satanjeev Banerjee,et al. The Design, Implementation, and Use of the Ngram Statistics Package , 2003, CICLing.

[161] Hang Cui,et al. Extending corpus-based identification of light verb constructions using a supervised learning framework , 2006 .

[162] Mark A. Finlayson,et al. Source code and data for MWE'2011 papers , 2011 .

[163] Ellen M. Voorhees,et al. Evaluating Evaluation Measure Stability , 2000, SIGIR 2000.

[164] Hinrich Schütze,et al. Introduction to information retrieval , 2008 .

[165] Joseph D. Becker. The Phrasal Lexicon , 1975, TINLAP.

[166] Y. Bar-Hillel. A Quasi-Arithmetical Notation for Syntactic Description , 1953 .

[167] Doug Beeferman,et al. Say what? why users choose to speak their web queries , 2010, INTERSPEECH.

[168] Timothy Baldwin,et al. Road-testing the English Resource Grammar Over the British National Corpus , 2004, LREC.

[169] Ted Pedersen,et al. Significant Lexical Relationships , 1996, AAAI/IAAI, Vol. 1.

[170] Satoshi Shirai,et al. Construction of a Dictionary for Translating Japanese Phrases into One English Word , 2001 .

[171] Philipp Koehn,et al. Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[172] Aline Villavicencio,et al. Automated Multiword Expression Prediction for Grammar Engineering , 2006 .

[173] Martin F. Porter,et al. An algorithm for suffix stripping , 1997, Program.