Regularized locality preserving indexing via spectral regression

We consider the problem of document indexing and representation. Recently, Locality Preserving Indexing (LPI) was proposed for learning a compact document subspace. Different from Latent Semantic Indexing (LSI) which is optimal in the sense of global Euclidean structure, LPI is optimal in the sense of local manifold structure. However, LPI is not efficient in time and memory which makes it difficult to be applied to very large data set. Specifically, the computation of LPI involves eigen-decompositions of two dense matrices which is expensive. In this paper, we propose a new algorithm called Regularized Locality Preserving Indexing (RLPI). Benefit from recent progresses on spectral graph analysis, we cast the original LPI algorithm into a regression framework which enable us to avoid eigen-decomposition of dense matrices. Also, with the regression based framework, different kinds of regularizers can be naturally incorporated into our algorithm which makes it more flexible. Extensive experimental results show that RLPI obtains similar or better results comparing to LPI and it is significantly faster, which makes it an efficient and effective data preprocessing method for large scale text clustering, classification and retrieval.

[1]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[2]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[3]  Deng Cai,et al.  Orthogonal locality preserving indexing , 2005, SIGIR '05.

[4]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[5]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[6]  Rie Kubota Ando Latent semantic space: iterative scaling improves precision of inter-document similarity measurement , 2000, SIGIR '00.

[7]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[8]  Hans C. van Houwelingen,et al.  The Elements of Statistical Learning, Data Mining, Inference, and Prediction. Trevor Hastie, Robert Tibshirani and Jerome Friedman, Springer, New York, 2001. No. of pages: xvi+533. ISBN 0‐387‐95284‐5 , 2004 .

[9]  B. E. Eckbo,et al.  Appendix , 1826, Epilepsy Research.

[10]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[11]  Wei-Ying Ma,et al.  Locality preserving indexing for document representation , 2004, SIGIR '04.

[12]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[13]  G. W. Stewart,et al.  Matrix Algorithms: Volume 1, Basic Decompositions , 1998 .

[14]  David G. Stork,et al.  Pattern Classification , 1973 .

[15]  Jiawei Han,et al.  Spectral Regression for Dimensionality Reduction , 2007 .

[16]  Michael A. Saunders,et al.  Algorithm 583: LSQR: Sparse Linear Equations and Least Squares Problems , 1982, TOMS.

[17]  Garrison W. Cottrell,et al.  Latent semantic indexing is an optimal special case of multidimensional scaling , 1992, SIGIR '92.

[18]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[19]  G. Stewart Matrix Algorithms, Volume II: Eigensystems , 2001 .

[20]  Michael A. Saunders,et al.  LSQR: An Algorithm for Sparse Linear Equations and Sparse Least Squares , 1982, TOMS.

[21]  Jiawei Han,et al.  Document clustering using locality preserving indexing , 2005, IEEE Transactions on Knowledge and Data Engineering.