Lexical association measures and collocation extraction

We present an extensive empirical evaluation of collocation extraction methods based on lexical association measures and their combination. The experiments are performed on three sets of collocation candidates extracted from the Prague Dependency Treebank with manual morphosyntactic annotation and from the Czech National Corpus with automatically assigned lemmas and part-of-speech tags. The collocation candidates were manually labeled as collocational or non-collocational. The evaluation is based on measuring the quality of ranking the candidates according to their chance to form collocations. Performance of the methods is compared by precision-recall curves and mean average precision scores. The work is focused on two-word (bigram) collocations only. We experiment with bigrams extracted from sentence dependency structure as well as from surface word order. Further, we study the effect of corpus size on the performance of the individual methods and their combination.

[1]  Jussi Piitulainen,et al.  Idiomatic Object Usage and Support Verbs , 1998, COLING-ACL.

[2]  Ofer Arazy,et al.  Enhancing Information Retrieval Through Statistical Natural Language Processing: A Study of Collocation Indexing , 2007, MIS Q..

[3]  P. Jaccard THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.1 , 1912 .

[4]  Adam Kilgarriff,et al.  WORD SKETCH: Extraction and Display of Signicant Collocations for Lexicography , 2000 .

[5]  Dekang Lin Using Collocation Statistics in Information Extraction , 1998, MUC.

[6]  Olga Vechtomova,et al.  Approaches to using word collocation in information retrieval , 2001 .

[7]  Pavel Pecina,et al.  Combining Association Measures for Collocation Extraction , 2006, ACL.

[8]  A. J. Conger Integration and generalization of kappas for multiple raters. , 1980 .

[9]  Hannah Kermes,et al.  Off-line (and on-line) text analysis for computational lexicography , 2003 .

[10]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[11]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Data Mining Researchers , 2003 .

[12]  Silvie Cinková,et al.  Semi-automatic Building of Swedish Collocation Lexicon , 2006, LREC.

[13]  Timothy Baldwin,et al.  Compositionality and Multiword Expressions: Six of One, Half a Dozen of the Other? , 2006 .

[14]  Stefan Evert,et al.  Methods for the Qualitative Evaluation of Lexical Association Measures , 2001, ACL.

[15]  Gregory Grefenstette,et al.  Querying across languages: a dictionary-based approach to multilingual information retrieval , 1996, SIGIR '96.

[16]  R. Sokal,et al.  Principles of numerical taxonomy , 1965 .

[17]  Ronald Carter,et al.  Vocabulary: Applied Linguistic Perspectives , 1987 .

[18]  William D. Raymond,et al.  The effects of collocational strength and contextual predictability in lexical production 1 , 1999 .

[19]  Albert Sydney Hornby,et al.  Thousand-word English : what it is and what can be done with it , 1937 .

[20]  William N. Venables,et al.  Modern Applied Statistics with S , 2010 .

[21]  Kenneth Ward Church,et al.  Introduction to the Special Issue on Computational Linguistics Using Large Corpora , 1993, Comput. Linguistics.

[22]  Jaideep Srivastava,et al.  Selecting the right interestingness measure for association patterns , 2002, KDD.

[23]  Wolfgang Lezius,et al.  IMSLex – Representing Morphological and Syntactic Information in a Relational Database , 2000 .

[24]  Francine Chen,et al.  A trainable document summarizer , 1995, SIGIR '95.

[25]  Ellen M. Voorhees,et al.  Evaluating Evaluation Measure Stability , 2000, SIGIR 2000.

[26]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[27]  Olivier Ferret,et al.  Using Collocations for Topic Segmentation and Link Detection , 2002, COLING.

[28]  František Čermák,et al.  Slovník české frazeologie a idiomatiky , 2009 .

[29]  J. Jenkins,et al.  Word association norms , 1964 .

[30]  Thierry Fontenelle What on earth are collocations , 1994 .

[31]  Pascale Fung,et al.  Extracting Japanese Domain and Technical Terms is Relatively Easy , 1996 .

[32]  David A. Hull Using statistical testing in the evaluation of retrieval experiments , 1993, SIGIR.

[33]  Michael Lesk,et al.  Word-word associations in document retrieval systems , 1969 .

[34]  Tibor Kiss,et al.  Viewing sentence boundary detection as collocation identification , 2000 .

[35]  Nitin Indurkhya,et al.  Handbook of Natural Language Processing , 2010 .

[36]  U. Quasthoff,et al.  The Poisson Collocation Measure and its Applications , 2002 .

[37]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[38]  Satoru Ikehara,et al.  Learning Bilingual Collocations by Word-Level Sorting , 1996, COLING.

[39]  Willem J. Heiser,et al.  Similarity coefficients for binary data : properties of coefficients, coefficient matrices, multi-way metrics and multivariate coefficients , 2003 .

[40]  Robert N. Oddy,et al.  Using cause-effect relations in text to improve information retrieval precision , 2001, Inf. Process. Manag..

[41]  Silvie Cinková,et al.  LEMPAS: A make-do lemmatizer for the Swedish PAROLE-corpus , 2006, Prague Bull. Math. Linguistics.

[42]  Graeme Hirst,et al.  Acquiring Collocations for Lexical Choice between Near-Synonyms , 2002, Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition -.

[43]  Jan Hajic Disambiguation of Rich Inflection - Computational Morphology of Czech , 2004 .

[44]  Brian D. Ripley,et al.  Modern applied statistics with S, 4th Edition , 2002, Statistics and computing.

[45]  Fernando Pereira,et al.  Non-Projective Dependency Parsing using Spanning Tree Algorithms , 2005, HLT.

[46]  Sabine Bartsch Structural and functional properties of collocations in English : a corpus study of lexical and pragmatic constraints on lexical co-occurrence , 2004 .

[47]  W. Bruce Croft,et al.  Dictionary Methods for Cross-Lingual Information Retrieval , 1996, DEXA.

[48]  Jan Hajic,et al.  The Prague Dependency Treebank , 2003 .

[49]  Béatrice Daille,et al.  Study and Implementation of Combined Techniques for Automatic Extraction of Terminology , 1994 .

[50]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[51]  Roberto Basili,et al.  Semi-automatic extraction of linguistic information for syntactic disambiguation , 1993, Appl. Artif. Intell..

[52]  John Sinclair,et al.  Corpus, Concordance, Collocation , 1991 .

[53]  Ming Zhou,et al.  Synonymous Collocation Extraction Using Translation Information , 2003, ACL.

[54]  John Carroll,et al.  Detecting a Continuum of Compositionality in Phrasal Verbs , 2003, ACL 2003.

[55]  Kenji Kita,et al.  COLLOCATIONS IN LANGUAGE LEARNING: CORPUS‐BASED AUTOMATIC COMPILATION OF COLLOCATIONS AND BILINGUAL COLLOCATION CONCORDANCER , 1997 .

[56]  Pavel Pecina AMachine Learning Approach to Multiword Expression Extraction , 2008 .

[57]  Pavel Pecina An Extensive Empirical Study of Collocation Extraction Methods , 2005, ACL.

[58]  Christian Biemann,et al.  Automatic Acquisition of Paradigmatic Relations Using Iterated Co-occurrences , 2004, LREC.

[59]  María Begoña Villada Moirón,et al.  University of Groningen Data-driven identification of fixed expressions and their modifiability , 2005 .

[60]  Pascale Fung,et al.  Finding Terminology Translations from Non-parallel Corpora , 1997, VLC.

[61]  Mark Johnson,et al.  Unsupervised learning of multi-word verbs , 2001 .

[62]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[63]  Reinhard Rapp,et al.  The Computation of Word Associations: Comparing Syntagmatic and Paradigmatic Approaches , 2002, COLING.

[64]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[65]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .

[66]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[67]  Kenneth Ward Church,et al.  Parsing, Word Associations and Typical Predicate-Argument Relations , 1989, HLT.

[68]  Robert Tibshirani,et al.  The Entire Regularization Path for the Support Vector Machine , 2004, J. Mach. Learn. Res..

[69]  Stefan Evert,et al.  Experiments on Candidate Data for Collocation Extraction , 2003, EACL.

[70]  Joel L. Fagan The effectiveness of a nonsyntatic approach to automatic phrase indexing for document retrieval , 1989 .

[71]  ChengXiang Zhai,et al.  Noun-Phrase Analysis in Unrestricted Text for Information Retrieval , 1996, ACL.

[72]  José Gabriel Pereira Lopes,et al.  Combining Linguistics with statistics for multiword term extraction: a fruitfull association? , 2000, RIAO.

[73]  Philip Resnik,et al.  Selectional Preference and Sense Disambiguation , 1997 .

[74]  Hiyan Alshawi,et al.  Training and Scaling Preference Functions for Disambiguation , 1994, Comput. Linguistics.

[75]  Reinhard Rapp Utilizing the One-Sense-per-Discourse Constraint for Fully Unsupervised Word Sense Induction and Disambiguation , 2004, LREC.

[76]  Ted Pedersen,et al.  An Evaluation Exercise for Word Alignment , 2003, ParallelTexts@NAACL-HLT.

[77]  Toshihide Ibaraki,et al.  Logical analysis of numerical data , 1997, Math. Program..

[78]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[79]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[80]  Kenji Kita,et al.  A comparative study of automatic extraction of collocations from corpora: mutual information vs , 1994 .

[81]  Ulrich Heid Towards a corpus-based dictionary of German noun-verb collocations , 1998 .

[82]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[83]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[84]  Darren Pearce A Comparative Evaluation of Collocation Extraction Techniques , 2002, LREC.

[85]  Ido Dagan,et al.  Similarity-Based Models of Word Cooccurrence Probabilities , 1998, Machine Learning.

[86]  D J Rogers,et al.  A Computer Program for Classifying Plants. , 1960, Science.

[87]  Jian-Yun Nie,et al.  Word Pairs in Language Modeling for Information Retrieval , 2004, RIAO.

[88]  Yaacov Choueka,et al.  Looking for Needles in a Haystack or Locating Interesting Collocational Expressions in Large Textual Databases , 1988, RIAO Conference.

[89]  Jan Hajič,et al.  The Best of Two Worlds: Cooperation of Statistical and Rule-Based Taggers for Czech , 2007, ACL 2007.

[90]  Stefan Evert,et al.  The Statistics of Word Cooccur-rences: Word Pairs and Collocations , 2004 .

[91]  Michael Halliday,et al.  Cohesion in English , 1976 .

[92]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[93]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[94]  Robert C. Moore On Log-Likelihood-Ratios and the Significance of Rare Events , 2004, EMNLP.

[95]  Patrick Pantel,et al.  Discovering word senses from text , 2002, KDD.

[96]  Reinhard Rapp,et al.  Automatic Identification of Word Translations from Unrelated English and German Corpora , 1999, ACL.

[97]  Vasileios Hatzivassiloglou,et al.  Translating Collocations for Bilingual Lexicons: A Statistical Approach , 1996, CL.

[98]  J. Bahns Lexical collocations: a contrastive view , 1993 .

[99]  Robert F. Ilson,et al.  The BBI Combinatory Dictionary of English: A guide to word combinations , 1989 .

[100]  Kathleen McKeown,et al.  Automatically Extracting and Representing Collocations for Language Generation , 1990, ACL.

[101]  Cesare Baroni-Urbani,et al.  Similarity of Binary Data , 1976 .

[102]  P. F. Russell,et al.  On Habitat and Association of Species of Anopheline Larvae in South-eastern Madras. , 1940 .

[103]  M. J. Wallace What is an Idiom? An Applied Linguistic Approach , 1979 .

[104]  Slava M. Katz,et al.  Technical terminology: some linguistic properties and an algorithm for identification in text , 1995, Natural Language Engineering.

[105]  R. Tamir,et al.  Mining the Web to discover the meanings of an ambiguous word , 2003, Third IEEE International Conference on Data Mining.

[106]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[107]  Dekang Lin,et al.  Automatic Identification of Non-compositional Phrases , 1999, ACL.

[108]  Mitchell P. Marcus,et al.  Parsing a Natural Language Using Mutual Information Statistics , 1990, AAAI.

[109]  Ted Pedersen,et al.  A Decision Tree of Bigrams is an Accurate Predictor of Word Sense , 2001, NAACL.

[110]  Laurie Bauer,et al.  English Word-Formation: Frontmatter , 1983 .

[111]  Joel L Fagan,et al.  Experiments in Automatic Phrase Indexing For Document Retrieval: A Comparison of Syntactic and Non-Syntactic Methods , 1987 .

[112]  Nikos Fakotakis,et al.  Comparative Evaluation of Collocation Extraction Metrics , 2002, LREC.

[113]  Andreas Bode,et al.  Improved Discriminative Bilingual Word Alignment , 2006, ACL.

[114]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[115]  Ben Taskar,et al.  A Discriminative Matching Approach to Word Alignment , 2005, HLT.

[116]  ChengXiang Zhai,et al.  Exploiting Context to Identify Lexical Atoms - A Statistical View of Linguistic Context , 1997, ArXiv.

[117]  Sophia Ananiadou,et al.  Identifying contextual information for multi-word term extraction , 1999 .

[118]  Kenneth Ward Church,et al.  Termight: Identifying and Translating Technical Terminology , 1994, ANLP.

[119]  Sayori Shimohata,et al.  Retrieving Collocations by Co-Occurrences and Word Order Constraints , 1997, ACL.

[120]  Lillian Lee,et al.  On the effectiveness of the skew divergence for statistical language analysis , 2001, AISTATS.

[121]  George Gaylord Simpson,et al.  Mammals and the nature of continents , 1943 .

[122]  Pavel Pecina Reference Data for Czech Collocation Extraction , 2008 .

[123]  John D. Lafferty,et al.  A Model of Lexical Attraction and Repulsion , 1997, ACL.

[124]  Slava M. Katz,et al.  Co-Occurrences of Antonymous Adjectives and Their Contexts , 1991, Comput. Linguistics.

[125]  Matthew Stone,et al.  Paying Heed to Collocations , 1996, INLG.

[126]  Sayori Shimohata,et al.  Retrieving Collocations by Co-Occurrences and Word Order Constraints , 1997, ACL.

[127]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[128]  Daniel Jurafsky,et al.  Is Knowledge-Free Induction of Multiword Unit Dictionary Headwords a Solved Problem? , 2001, EMNLP.

[129]  Simone Teufel,et al.  Corpus-based Method for Automatic Identification of Support Verbs for Nominalizations , 1995, EACL.

[130]  Ted Pedersen,et al.  Fishing for Exactness , 1996, ArXiv.

[131]  Marco Baroni,et al.  Unsupervised discovery of morphologically related words based on orthographic and semantic similarity , 2002, SIGMORPHON.

[132]  Brigitte Krenn,et al.  The usual suspects: data-oriented models for identification und representation of lexical collocations , 1999 .

[133]  I. Dan Melamed,et al.  Models of translation equivalence among words , 2000, CL.

[134]  Debra S. Baddorf,et al.  Finding Phrases Rather Than Discovering Collocations: Searching Corpora for Dictionary Phrases , 1998 .

[135]  Tibor Kiss,et al.  Scaled Log Likelihood Ratios for the Detection of Abbreviations in Text Corpora , 2002, COLING.

[136]  I. Jolliffe Principal Component Analysis , 2002 .

[137]  Marc Weeber,et al.  Extracting the lowest-frequency words: pitfalls and possibilities , 2000, CL.

[138]  J. Votrubec Morphological Tagging Based on Averaged Perceptron , 2006 .

[139]  姚小平,et al.  语言学简史 : [英文版] = A Short History of Linguistics , 1969 .

[140]  Janyce Wiebe,et al.  Collocational Properties in Probabilistic Classifiers for Discourse Categorization , 1998 .

[141]  J. Braun-Blanquet,et al.  Plant sociology; the study of plant communities; authorized English translation of Pflanzensoziologie, by Dr. J. Braun-Blanquet. Translated, revised and edited by George D. Fuller and Henry S. Conard. , 1932 .

[142]  Charles L. A. Clarke,et al.  Frequency Estimates for Statistical Word Similarity Measures , 2003, NAACL.

[143]  Peter Schäuble,et al.  Using the Co-occurrence of Words for Retrieval Weighting , 2000, Information Retrieval.

[144]  Timothy Baldwin,et al.  A Statistical Approach to the Semantics of Verb-Particles , 2003, ACL 2003.

[145]  Walter Krämer,et al.  Review of Modern applied statistics with S, 4th ed. by W.N. Venables and B.D. Ripley. Springer-Verlag 2002 , 2003 .