Orthogonal locality preserving indexing

We consider the problem of document indexing and representation. Recently, Locality Preserving Indexing (LPI) was proposed for learning a compact document subspace. Different from Latent Semantic Indexing which is optimal in the sense of global Euclidean structure, LPI is optimal in the sense of local manifold structure. However, LPI is extremely sensitive to the number of dimensions. This makes it difficult to estimate the intrinsic dimensionality, while inaccurately estimated dimensionality would drastically degrade its performance. One reason leading to this problem is that LPI is non-orthogonal. Non-orthogonality distorts the metric structure of the document space. In this paper, we propose a new algorithm called Orthogonal LPI. Orthogonal LPI iteratively computes the mutually orthogonal basis functions which respect the local geometrical structure. Moreover, our empirical study shows that OLPI can have more locality preserving power than LPI. We compare the new algorithm to LSI and LPI. Extensive experimental results show that Orthogonal LPI obtains better performance than both LSI and LPI. More crucially, it is insensitive to the number of dimensions, which makes it an efficient data preprocessing method for text clustering, classification, retrieval, etc.

[1]  Ulrike von Luxburg,et al.  Limits of Spectral Clustering , 2004, NIPS.

[2]  Balázs Kégl,et al.  Intrinsic Dimension Estimation Using Packing Numbers , 2002, NIPS.

[3]  Fan Chung,et al.  Spectral Graph Theory , 1996 .

[4]  David G. Stork,et al.  Pattern Classification , 1973 .

[5]  Lillian Lee,et al.  Iterative Residual Rescaling: An Analysis and Generalization of LSI , 2001, SIGIR 2002.

[6]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[7]  Sandhya Dwarkadas,et al.  On scaling latent semantic indexing for large peer-to-peer systems , 2004, SIGIR '04.

[8]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[9]  Garrison W. Cottrell,et al.  Latent semantic indexing is an optimal special case of multidimensional scaling , 1992, SIGIR '92.

[10]  Yousef Saad,et al.  Polynomial filtering in latent semantic indexing for information retrieval , 2004, SIGIR '04.

[11]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[12]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[13]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[14]  Rie Kubota Ando Latent semantic space: iterative scaling improves precision of inter-document similarity measurement , 2000, SIGIR '00.

[15]  Wei-Ying Ma,et al.  Locality preserving indexing for document representation , 2004, SIGIR '04.

[16]  Gene H. Golub,et al.  Matrix computations , 1983 .

[17]  C. Ding A similarity-based probability model for latent semantic indexing , 1999, SIGIR '99.

[18]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[19]  Mikhail Belkin,et al.  Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering , 2001, NIPS.

[20]  F. Chung Spectral Graph Theory, Regional Conference Series in Math. , 1997 .

[21]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.