Statistics and linguistic rules in multiword extraction: a comparative analysis

A hybrid methodology is proposed for extracting multiword expressions based on linguistic and statistical information. In the proposed methodology, N-grams are extracted by linguistic patterns and then various statistical measures are applied for classifying these N-grams as multiword expressions. To solve the problem of deciding cut-off boundary threshold in statistical filtering phase, a novel method for calculating boundary threshold is designed. Comparative analysis between the baseline method and the proposed methodology is presented. In the baseline method, firstly, N-grams are filtered by statistical measures and then linguistic filtering is applied. Precision, recall and ƒ-Score are calculated on manually annotated corpus. Observed results show that the proposed methodology provides good results for certain types of multiword expressions like compound nouns, verb-particles and verb-verb.

[1]  Arvi Hurskainen Multiword Expressions and Machine Translation , 2008 .

[2]  Suzanne Stevenson,et al.  Distinguishing Subtypes of Multiword Expressions Using Linguistically-Motivated Statistical Measures , 2007 .

[3]  Olga Vechtomova,et al.  The Role of Multi-word Units in Interactive Information Retrieval , 2005, ECIR.

[4]  Timothy Baldwin,et al.  An Empirical Model of Multiword Expression Decomposability , 2003, ACL 2003.

[5]  Dan I. Moldovan,et al.  On the semantics of noun compounds , 2005, Comput. Speech Lang..

[6]  Mona T. Diab,et al.  Verb Noun Construction MWE Token Classification , 2009, MWE@IJCNLP.

[7]  Stefan Evert,et al.  Using small random samples for the manual evaluation of statistical association measures , 2005, Comput. Speech Lang..

[8]  Anoop Kunchukuttan,et al.  A System for Compound Noun Multiword Expression Extraction for Hindi , 2008 .

[9]  Helmer Strik,et al.  Multiword expressions in spontaneous speech: do we really speak like that? , 2005, INTERSPEECH.

[10]  Patrik Lambert,et al.  Alignment of Parallel Corpora Exploiting Asymmetrically Aligned Phrases , 2006 .

[11]  Constantinos Boulis CLUSTERING OF CEPSTRUM COEFFICIENTS USING PAIRWISE MUTUAL INFORMATION , 2001 .

[12]  Morton Benson,et al.  The BBI dictionary of English word combinations , 1991 .

[13]  Tim van de Cruys,et al.  Semantics-based Multiword Expression Extraction , 2007 .

[14]  Klaus Zechner,et al.  Automatic Summarization of Open-Domain Multiparty Dialogues in Diverse Genres , 2002, CL.

[15]  Eugenie Giesbrecht,et al.  Automatic Identification of Non-Compositional Multi-Word Expressions using Latent Semantic Analysis , 2006 .

[16]  Ray Jackendoff TWISTIN' THE NIGHT AWAY , 1997 .

[17]  Slava M. Katz,et al.  Technical terminology: some linguistic properties and an algorithm for identification in text , 1995, Natural Language Engineering.

[18]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[19]  Adam Kilgarriff,et al.  The Sketch Engine , 2004 .

[20]  Jan Snajder,et al.  Evaluation of Classification Algorithms and Features for Collocation Extraction in Croatian , 2012, LREC.

[21]  T. Van de Cruys,et al.  Lexico-Semantic Multiword Expression Extraction , 2007 .

[22]  Violeta Seretan A Collocation-Driven Approach to Text Summarization , 2011 .

[23]  Tony McEnery,et al.  Multi-word unit alignment in English-Chinese parallel corpora , 2001 .

[24]  Carlos Ramisch,et al.  A Generic Framework for Multiword Expressions Treatment: from Acquisition to Applications , 2012, ACL 2012.

[25]  Svenja Adolphs,et al.  Pauses as an Indicator of Psycholinguistically Valid Multi-Word Expressions (MWEs)? , 2007 .

[26]  Timothy Baldwin,et al.  A Resource for Evaluating the Deep Lexical Acquisition of English Verb-Particle Constructions , 2008, LREC 2008.

[27]  Xijin Tang,et al.  TFIDF, LSI and multi-word in information retrieval and text categorization , 2008, 2008 IEEE International Conference on Systems, Man and Cybernetics.

[28]  Sivaji Bandyopadhyay,et al.  Identification of Reduplication in Bengali Corpus and their Semantic Analysis: A Rule Based Approach , 2010, MWE@COLING.

[29]  Franz Josef Hausmann Wörterbücher : ein internationales Handbuch zur Lexikographie , 1989 .

[30]  Suzanne Stevenson,et al.  The VNC-Tokens Dataset , 2008 .

[31]  Darja Fiser,et al.  Harvesting Multi-Word Expressions from Parallel Corpora , 2008, LREC.

[32]  Jörg Tiedemann,et al.  Identifying idiomatic expressions using automatic word-alignment , 2006 .

[33]  Yulia Tsvetkov,et al.  Extraction of Multi-word Expressions from Small Parallel Corpora , 2010, COLING.

[34]  Jianyong Duan,et al.  A Hybrid Approach to Improve Bilingual Multiword Expression Extraction , 2009, PAKDD.

[35]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[36]  Anabela Barreiro,et al.  Taking on new challenges in multi-word unit processing for machine translation , 2011 .

[37]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[38]  Bridget T. McInnes,et al.  Extending the Log Likelihood Measure to Improve Collocation Identification , 2004 .

[39]  Driss Aboutajdine,et al.  A Multi-Word Term Extraction Program for Arabic Language , 2008, LREC.

[40]  Maciej Piasecki,et al.  Constraint Based Description of Polish Multiword Expressions , 2012, LREC.

[41]  Lou Boves,et al.  Multiword expressions in spoken language: An exploratory study on pronunciation variation , 2005, Comput. Speech Lang..

[43]  Rafael E. Banchs,et al.  Data Inferred Multi-word Expressions for Statistical Machine Translation , 2005 .