论文信息 - Word-Text Matrix Feature Selection in Chinese Text Classfication Based on LSI

Word-Text Matrix Feature Selection in Chinese Text Classfication Based on LSI

LSI can be regarded as a mapping of vector space model. Through carrying singular value decomposition computation on the word-text matrix in original text sets, the relationship among the latent connotation concepts in the documents sets can be calculated. Expressing all the concepts space by latent concept sets reduces the fuzziness among the concept expression and avoids the supposition that concept is orthogonal among each dimensionality in VSM. This paper studies the effect to text classification of Chinese word based on LSI after selecting four feature selection methods (Information Gain, Cross Entropy, Odds ratio, Union Odds Ratio, respectively) to reduce the number of dimensionalities of word-document matrix. The experimental results show that using Union Odds Ratio to reduce the number of the dimensionalities of word-text matrix can classify better than using the others in text classification based onLSI.

Rong Wang | Jianhua Wang | Yijun Gu

[1] Chih-Jen Lin,et al. LIBSVM: A library for support vector machines , 2011, TIST.

[2] Dunja Mladenic,et al. Feature Selection for Unbalanced Class Distribution and Naive Bayes , 1999, ICML.

[3] Dai Liu,et al. A Comparative Study on Feature Selection in Chinese Text Categorization , 2004 .

[4] William H. Press,et al. The Art of Scientific Computing Second Edition , 1998 .

[5] Barbara Di Eugenio,et al. FLSA: Extending Latent Semantic Analysis with Features for Dialogue Act Classification , 2004, ACL.

[6] William H. Press,et al. Numerical recipes in C. The art of scientific computing , 1987 .

[7] Kenneth DeJong,et al. Robust feature selection algorithms , 1993, Proceedings of 1993 IEEE Conference on Tools with Al (TAI-93).

[8] Dunja Mladenic,et al. Feature selection on hierarchy of web documents , 2003, Decis. Support Syst..

[9] Dimitrios Gunopulos,et al. Evaluating the utility of statistical phrases and latent semantic indexing for text classification , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..