Document clustering using locality preserving indexing

We propose a novel document clustering method which aims to cluster the documents into different semantic classes. The document space is generally of high dimensionality and clustering in such a high dimensional space is often infeasible due to the curse of dimensionality. By using locality preserving indexing (LPI), the documents can be projected into a lower-dimensional semantic space in which the documents related to the same semantics are close to each other. Different from previous document clustering methods based on latent semantic indexing (LSI) or nonnegative matrix factorization (NMF), our method tries to discover both the geometric and discriminating structures of the document space. Theoretical analysis of our method shows that LPI is an unsupervised approximation of the supervised linear discriminant analysis (LDA) method, which gives the intuitive motivation of our method. Extensive experimental evaluations are performed on the Reuters-21578 and TDT2 data sets.

[1]  Fred Cohen Managing Network Security: Academia's Vital Role in Information Protection , 2002 .

[2]  Mikhail Belkin,et al.  Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering , 2001, NIPS.

[3]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[4]  Sargur N. Srihari,et al.  A fast algorithm for finding k-nearest neighbors with non-metric dissimilarity , 2002, Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition.

[5]  C. Ding A similarity-based probability model for latent semantic indexing , 1999, SIGIR '99.

[6]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[7]  Santosh S. Vempala,et al.  Latent Semantic Indexing , 2000, PODS 2000.

[8]  Martine D. F. Schlag,et al.  Spectral K-Way Ratio-Cut Partitioning and Clustering , 1993, 30th ACM/IEEE Design Automation Conference.

[9]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[10]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[11]  Xin Liu,et al.  Document clustering with cluster refinement and model selection capabilities , 2002, SIGIR '02.

[12]  David G. Stork,et al.  Pattern Classification , 1973 .

[13]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[14]  Keinosuke Fukunaga,et al.  A Branch and Bound Algorithm for Computing k-Nearest Neighbors , 1975, IEEE Transactions on Computers.

[15]  Wei-Ying Ma,et al.  Locality preserving indexing for document representation , 2004, SIGIR '04.

[16]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[17]  Xiaofei He,et al.  Locality Preserving Projections , 2003, NIPS.

[18]  Tommi S. Jaakkola,et al.  Linear Dependent Dimensionality Reduction , 2003, NIPS.

[19]  David Cohn,et al.  Informed Projections , 2002, NIPS.

[20]  L. Lovász Matching Theory (North-Holland mathematics studies) , 1986 .

[21]  Chris H. Q. Ding,et al.  Spectral Relaxation for K-means Clustering , 2001, NIPS.

[22]  Stefan Siersdorfer,et al.  Restrictive clustering and metaclustering for self-organizing document collections , 2004, SIGIR '04.

[23]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[24]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[25]  Yihong Gong,et al.  Document clustering by concept factorization , 2004, SIGIR '04.

[26]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[27]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[28]  Garrison W. Cottrell,et al.  Latent semantic indexing is an optimal special case of multidimensional scaling , 1992, SIGIR '92.

[29]  Tao Li,et al.  Document clustering via adaptive subspace iteration , 2004, SIGIR '04.

[30]  Yair Weiss,et al.  Segmentation using eigenvectors: a unifying view , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[31]  Chris H. Q. Ding,et al.  A min-max cut algorithm for graph partitioning and data clustering , 2001, Proceedings 2001 IEEE International Conference on Data Mining.