A corpus-based approach to comparative evaluation of statistical term association measures

Statistical association measures have been widely applied in information retrieval research, usually employing a clustering of documents or terms on the basis of their relationships. Applications of the association measures for term clustering include automatic thesaurus construction and query expansion. This research evaluates the similarity of six association measures by comparing the relationship and behavior they demonstrate in various analyses of a test corpus. Analysis techniques include comparisons of highly ranked term pairs and term clusters, analyses of the correlation among the association measures using Pearson’s correlation coefficient and MDS mapping, and an analysis of the impact of a term frequency on the association values by means of z-score. The major findings of the study are as follows: First, the most similar association measures are mutual information and Yule’s coefficient of colligation Y, whereas cosine and Jaccard coefficients, as well as x statistic and likelihood ratio, demonstrate quite similar behavior for terms with high frequency. Second, among all the measures, the x statistic is the least affected by the frequency of terms. Third, although cosine and Jaccard coefficients tend to emphasize high frequency terms, mutual information and Yule’s Y seem to overestimate rare terms.

[1]  Masud Mansuripur,et al.  Introduction to information theory , 1986 .

[2]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[3]  Kyo Kageura,et al.  Bigram Statistics Revisited: A Comparative Examination of Some Statistical Measures in Morphological Analysis of Japanese Kanji Sequences , 1999, J. Quant. Linguistics.

[4]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[6]  H. Kang,et al.  Two-Level Document Ranking Using Mutual Information in Natural Language Information Retrieval , 1997, Inf. Process. Manag..

[7]  George Kingsley Zipf,et al.  Human Behaviour and the Principle of Least Effort: an Introduction to Human Ecology , 2012 .

[8]  Jianying Wang,et al.  A corpus analysis approach for automatic query expansion , 1997, CIKM '97.

[9]  Jack Minker,et al.  An evaluation of query expansion by the addition of clustered terms for a document retrieval system , 1972, Inf. Storage Retr..

[10]  Clement T. Yu,et al.  A theory of term importance in automatic text analysis , 1974, J. Am. Soc. Inf. Sci..

[11]  Tetsuya Morita,et al.  A fuzzy document retrieval system using the keyword connection matrix and a learning method , 1991 .

[12]  Barbara A. Norgard,et al.  An association-based method for automatic indexing with a controlled vocabulary , 1998 .

[13]  Michael Oakes,et al.  Statistics for Corpus Linguistics , 1998 .

[14]  Reginald Ferber,et al.  An Associative Model of Word Selection in the Generation of Search Queries , 1995, J. Am. Soc. Inf. Sci..

[15]  Andreas S. Weigend,et al.  A neural network approach to topic spotting , 1995 .

[16]  Vijay V. Raghavan,et al.  Single-pass method for determining the semantic relationships between terms , 1977, J. Am. Soc. Inf. Sci..

[17]  Key-Sun Choi,et al.  Query expansion using domain-adapted, weighted thesaurus in an extended Boolean model , 1994, CIKM '94.

[18]  DAVID G. KENDALL,et al.  Introduction to Mathematical Statistics , 1947, Nature.

[19]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[20]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[21]  R. Sokal,et al.  Numerical Taxonomy: The Principles and Practice of Numerical Classification. , 1975 .

[22]  Vipin Kumar,et al.  WebACE: a Web agent for document categorization and exploration , 1998, AGENTS '98.

[23]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[24]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[25]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[26]  Michael E. Lesk,et al.  Computer Evaluation of Indexing and Text Processing , 1968, JACM.

[27]  Michael McGill,et al.  An Evaluation of Factors Affecting Document Ranking by Information Retrieval Systems. , 1979 .

[28]  Gregory Grefenstette Explorations in Automatic Thesaurus Construction , 1994 .

[29]  Key-Sun Choi,et al.  A Comparison of Collocation-Based Similarity Measures in Query Expansion , 1999, Inf. Process. Manag..

[30]  Jack Minker,et al.  An Analysis of Some Graph Theoretical Cluster Techniques , 1970, JACM.

[31]  Martin Dillon,et al.  A technique for evaluating automatic term clustering , 1980, J. Am. Soc. Inf. Sci..

[32]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[33]  Christian Delcourt,et al.  About the statistical analysis of co-occurrence , 1992, Comput. Humanit..

[34]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[35]  K. J. Lynch,et al.  Automatic construction of networks of concepts characterizing document databases , 1992, IEEE Trans. Syst. Man Cybern..

[36]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[37]  G. Reinsel,et al.  Introduction to Mathematical Statistics (4th ed.). , 1980 .

[38]  Christer Johansson Good Bigrams , 1996, COLING.

[39]  S. Miyamoto Information retrieval based on fuzzy associations , 1990 .

[40]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[41]  Kenji Kita,et al.  A comparative study of automatic extraction of collocations from corpora: mutual information vs , 1994 .

[42]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[43]  Richard W. Hamming,et al.  Coding and Information Theory , 1980 .

[44]  Key-Sun Choi,et al.  Automatic thesaurus construction using Bayesian networks , 1995, CIKM '95.

[45]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[46]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[47]  Michelle Q. Wang Baldonado,et al.  SONIA: a service for organizing networked information autonomously , 1998, DL '98.