Text Classification Using Small Number of Features

Feature selection method for text classification based on information gain ranking, improved by removing redundant terms using mutual information measure and inclusion index, is proposed. We report an experiment to study the impact of term redundancy on the performance of text classifier. The result shows that term redundancy behaves very similar to noise and may degrade the classifier performance. The proposed method is tested on an SVM text classifier. Feature reduction by this method remarkably outperforms information gain based feature selection.

[1]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[2]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[3]  Ido Dagan,et al.  Similarity-Based Models of Word Cooccurrence Probabilities , 1998, Machine Learning.

[4]  Gang Wang,et al.  Feature selection with conditional mutual information maximin in text categorization , 2004, CIKM '04.

[5]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[6]  Wai Lam,et al.  Automatic Text Categorization and Its Application to Text Retrieval , 1999, IEEE Trans. Knowl. Data Eng..

[7]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[8]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[9]  Evgeniy Gabrilovich,et al.  Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5 , 2004, ICML.

[10]  Yiming Yang,et al.  High-performing feature selection for text classification , 2002, CIKM '02.

[11]  Guy W. Mineau,et al.  A Simple Feature Selection Method for Text Classification , 2001, IJCAI.

[12]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[13]  Gerard Salton,et al.  Recent trends in automatic information retrieval , 1986, SIGIR '86.

[14]  Michael W. Berry,et al.  Survey of Text Mining: Clustering, Classification, and Retrieval , 2007 .

[15]  Vijay V. Raghavan,et al.  Vector Space Model of Information Retrieval - A Reevaluation , 1984, SIGIR.

[16]  Michael W. Berry,et al.  Survey of Text Mining , 2003, Springer New York.

[17]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[18]  W. Bruce Croft,et al.  Corpus-based stemming using cooccurrence of word variants , 1998, TOIS.

[19]  Joydeep Ghosh,et al.  Cluster Ensembles A Knowledge Reuse Framework for Combining Partitionings , 2002, AAAI/IAAI.