GS-Orthogonalization Based "Basis Feature" Selection from Word Co-occurrence Matrix

Feature selection plays an important role in machinelearning applications. Especially for text data, the highdimensionaland sparse characteristics will affect the performanceof feature selction. In this paper, an unsupervised feature selection algorithm through Random Projection and Gram-Schmidt Orthogonalization (RP-GSO) from the word co-occurrence matrix is proposed. The RP-GSO has three advantages: (1) it takes as input dense word co-occurrence matrix, avoiding the sparseness of original document-term matrix, (2) it selects "basis features" by Gram-Schmidt process, guaranteeing the orthogonalization of feature space, and (3) it adopts random projection to speed upGS process. We did extensive experiments on two real-world textcorpora, and observed that RP-GSO achieves better performancecomparing against supervised and unsupervised methods in textclassification and clustering tasks.

[1]  Sheng Chen,et al.  Orthogonal least squares methods and their application to non-linear system identification , 1989 .

[2]  Gérard Dreyfus,et al.  Ranking a Random Feature for Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[3]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[4]  Dimitris Achlioptas,et al.  Database-friendly random projections , 2001, PODS.

[5]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[6]  Deqing Wang,et al.  t-Test feature selection approach based on term frequency for text categorization , 2014, Pattern Recognit. Lett..

[7]  Deng Cai,et al.  Unsupervised feature selection for multi-cluster data , 2010, KDD.

[8]  Vikas Sindhwani,et al.  Fast Conical Hull Algorithms for Near-separable Non-negative Matrix Factorization , 2012, ICML.

[9]  Venkatesh Saligrama,et al.  Topic Discovery through Data Dependent and Random Projections , 2013, ICML.

[10]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[11]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[13]  Junjie Wu,et al.  Towards enhancing centroid classifier for text classification - A border-instance approach , 2013, Neurocomputing.

[14]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[15]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[16]  Sanjeev Arora,et al.  A Practical Algorithm for Topic Modeling with Provable Guarantees , 2012, ICML.

[17]  Victoria Stodden,et al.  When Does Non-Negative Matrix Factorization Give a Correct Decomposition into Parts? , 2003, NIPS.

[18]  Sanjeev Arora,et al.  Computing a nonnegative matrix factorization -- provably , 2011, STOC '12.