Using chi-square statistics to measure similarities for text categorization

In this paper, we propose using chi-square statistics to measure similarities and chi-square tests to determine the homogeneity of two random samples of term vectors for text categorization. The properties of chi-square tests for text categorization are studied first. One of the advantages of chi-square test is that its significance level is similar to the miss rate that provides a foundation for theoretical performance (i.e. miss rate) guarantee. Generally a classifier using cosine similarities with TF*IDF performs reasonably well in text categorization. However, its performance may fluctuate even near the optimal threshold value. To improve the limitation, we propose the combined usage of chi-square statistics and cosine similarities. Extensive experiment results verify properties of chi-square tests and performance of the combined usage.

[1]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[2]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[3]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[4]  David D. Lewis,et al.  Representation and Learning in Information Retrieval , 1991 .

[5]  Tao Tao,et al.  Organizing structured web sources by query schemas: a clustering approach , 2004, CIKM '04.

[6]  H. O. Lancaster The chi-squared distribution , 1971 .

[7]  P. Greenwood,et al.  A Guide to Chi-Squared Testing , 1996 .

[8]  Dekang Lin,et al.  WordNet: An Electronic Lexical Database , 1998 .

[9]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[10]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[11]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[12]  H. Ahrens Lancaster, H. O.: The Chi‐squared Distribution. Wiley & Sons, Inc., New York 1969. X, 366 S., 140 s , 1971 .

[13]  Meng Chang Chen,et al.  A Study of \chi^2-test for Text Categorization , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[14]  Karl Pearson F.R.S. X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling , 2009 .

[15]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[16]  Yiming Yang,et al.  A study of thresholding strategies for text categorization , 2001, SIGIR '01.

[17]  Yaxin Bi,et al.  Combining Multiple Classifiers Using Dempster's Rule of Combination for Text Categorization , 2004, MDAI.

[18]  Tao Tao,et al.  A formal study of information retrieval heuristics , 2004, SIGIR '04.

[19]  Stephen E. Robertson,et al.  On relevance weights with little relevance information , 1997, SIGIR '97.

[20]  Ramesh Nallapati,et al.  Discriminative models for information retrieval , 2004, SIGIR '04.

[21]  Yiming Yang,et al.  Improving text categorization methods for event tracking , 2000, SIGIR '00.