SVM-based feature selection of latent semantic features

Latent Semantic Indexing (LSI) is an effective method to extract features that captures underlying latent semantic structure in the word usage across documents, However, subspace selected by this method may not be the most appropriate one to classify documents, since it orders extracted features according to their variances, not the classification power. We propose to apply feature ordering method based on support vector machines in order to select LSI-features that is suited for classification. Experimental results suggest that the method improves classification performance with considerably more compact representation.

[1]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[2]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[3]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[4]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[5]  Susan T. Dumais,et al.  Using LSI for information filtering: TREC-3 experiments , 1995 .

[6]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[7]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[8]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[10]  R. E. Story,et al.  An Explanation of the Effectiveness of Latent Semantic Indexing by Means of a Bayesian Regression Model , 1996, Inf. Process. Manag..

[11]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[12]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[13]  Marko Grobelnik,et al.  Feature Selection Using Support Vector Machines , 2002 .

[14]  Hinrich Schütze,et al.  Projections for efficient document clustering , 1997, SIGIR '97.

[15]  James T. Kwok,et al.  Automated Text Categorization Using Support Vector Machine , 1998, ICONIP.

[16]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[17]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.