Normalized (pointwise) mutual information in collocation extraction

In this paper, we discuss the related information theoretical association measures of mutual information and pointwise mutual information, in the context of collocation extraction. We introduce normalized variants of these measures in order to make them more easily interpretable and at the same time less sensitive to occurrence frequency. We also provide a small empirical study to give more insight into the behaviour of these new measures in a collocation extraction setup.

[1]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[2]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[3]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[4]  B. Daille Approche mixte pour l'extraction de terminologie : statistique lexicale et filtres linguistiques , 1994 .

[5]  Nikos Fakotakis,et al.  Comparative Evaluation of Collocation Extraction Metrics , 2002, LREC.

[6]  Yiyu Yao,et al.  Information-Theoretic Measures for Knowledge Discovery and Data Mining , 2003 .

[7]  Y. Yao,et al.  Information-Theoretic Measures for Knowledge Discovery and Data Mining , 2003 .

[8]  P. Kantor Foundations of Statistical Natural Language Processing , 2001, Information Retrieval.

[9]  Stefan Evert,et al.  The Statistics of Word Cooccur-rences: Word Pairs and Collocations , 2004 .

[10]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[11]  Stefan Evert,et al.  Corpora and collocations , 2007 .

[12]  Timothy Baldwin,et al.  A Resource for Evaluating the Deep Lexical Acquisition of English Verb-Particle Constructions , 2008, LREC 2008.

[13]  Pavel Pecina Reference Data for Czech Collocation Extraction , 2008 .

[14]  S. Evert A Lexicographic Evaluation of German Adjective-Noun Collocations , 2008 .

[15]  Pavel Pecina AMachine Learning Approach to Multiword Expression Extraction , 2008 .

[16]  Carlos Ramisch,et al.  An Evaluation of Methods for the Extraction of Multiword Expressions , 2008, LREC 2008.