Word-Text Matrix Feature Selection in Chinese Text Classfication Based on LSI

LSI can be regarded as a mapping of vector space model. Through carrying singular value decomposition computation on the word-text matrix in original text sets, the relationship among the latent connotation concepts in the documents sets can be calculated. Expressing all the concepts space by latent concept sets reduces the fuzziness among the concept expression and avoids the supposition that concept is orthogonal among each dimensionality in VSM. This paper studies the effect to text classification of Chinese word based on LSI after selecting four feature selection methods (Information Gain, Cross Entropy, Odds ratio, Union Odds Ratio, respectively) to reduce the number of dimensionalities of word-document matrix. The experimental results show that using Union Odds Ratio to reduce the number of the dimensionalities of word-text matrix can classify better than using the others in text classification based onLSI.