论文信息 - Using chi-square statistics to measure similarities for text categorization

Using chi-square statistics to measure similarities for text categorization

In this paper, we propose using chi-square statistics to measure similarities and chi-square tests to determine the homogeneity of two random samples of term vectors for text categorization. The properties of chi-square tests for text categorization are studied first. One of the advantages of chi-square test is that its significance level is similar to the miss rate that provides a foundation for theoretical performance (i.e. miss rate) guarantee. Generally a classifier using cosine similarities with TF*IDF performs reasonably well in text categorization. However, its performance may fluctuate even near the optimal threshold value. To improve the limitation, we propose the combined usage of chi-square statistics and cosine similarities. Extensive experiment results verify properties of chi-square tests and performance of the combined usage.

Meng Chang Chen | Yao-Tsung Chen | Meng Chang Chen | Yao-Tsung Chen

[1] M. F. Porter,et al. An algorithm for suffix stripping , 1997 .

[2] Christiane Fellbaum,et al. Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[3] Yiming Yang,et al. A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[4] David D. Lewis,et al. Representation and Learning in Information Retrieval , 1991 .

[5] Tao Tao,et al. Organizing structured web sources by query schemas: a clustering approach , 2004, CIKM '04.

[6] H. O. Lancaster. The chi-squared distribution , 1971 .

[7] P. Greenwood,et al. A Guide to Chi-Squared Testing , 1996 .

[8] Dekang Lin,et al. WordNet: An Electronic Lexical Database , 1998 .

[9] Yiming Yang,et al. Topic Detection and Tracking Pilot Study Final Report , 1998 .

[10] Yiming Yang,et al. A re-examination of text categorization methods , 1999, SIGIR '99.

[11] Stephen E. Robertson,et al. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[12] H. Ahrens. Lancaster, H. O.: The Chi‐squared Distribution. Wiley & Sons, Inc., New York 1969. X, 366 S., 140 s , 1971 .

[13] Meng Chang Chen,et al. A Study of \chi^2-test for Text Categorization , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[14] Karl Pearson F.R.S.. X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling , 2009 .

[15] Ricardo Baeza-Yates,et al. Information Retrieval: Data Structures and Algorithms , 1992 .

[16] Yiming Yang,et al. A study of thresholding strategies for text categorization , 2001, SIGIR '01.

[17] Yaxin Bi,et al. Combining Multiple Classifiers Using Dempster's Rule of Combination for Text Categorization , 2004, MDAI.

[18] Tao Tao,et al. A formal study of information retrieval heuristics , 2004, SIGIR '04.

[19] Stephen E. Robertson,et al. On relevance weights with little relevance information , 1997, SIGIR '97.

[20] Ramesh Nallapati,et al. Discriminative models for information retrieval , 2004, SIGIR '04.

[21] Yiming Yang,et al. Improving text categorization methods for event tracking , 2000, SIGIR '00.