论文信息 - Document Clustering in Reduced Dimension Vector Space

Document Clustering in Reduced Dimension Vector Space

Document clustering is a popular tool for automatically organizing a large collection of texts. Clustering algorithms are usually applied to documents represented as vectors in a high dimensional term space. We investigate the use of Latent Semantic Analysis to create a new vector space, that is the optimal representation of the document collection. Documents are projected onto a small subspace of this vector space and clustered. We compare the performance of clustering algorithms when applied to documents represented in the full term space and in reduced dimension subspace of the LSA-generated vector space. We report significant improvements in cluster quality for LSA subspaces with optimal dimensionality. We discuss the procedure for determining the right number of dimensions for the subspace. Moreover, when this number is small, the total running time of the clustering algorithm is comparable to the one that uses the full term space.

Kristina Lerman | Kristina Lerman

[1] Hinrich Schütze,et al. Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[2] Richard A. Harshman,et al. Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[3] Susan T. Dumais,et al. Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[4] Hinrich Schütze,et al. Projections for efficient document clustering , 1997, SIGIR '97.

[5] T. Landauer,et al. A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[6] Peter W. Foltz,et al. Learning from text: Matching readers and texts by latent semantic analysis , 1998 .

[7] Vincent Kanade,et al. Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[8] Ricardo Baeza-Yates,et al. Information Retrieval: Data Structures and Algorithms , 1992 .

[9] Susan T. Dumais,et al. Latent Semantic Indexing (LSI): TREC-3 Report , 1994, TREC.

[10] Michael McGill,et al. Introduction to Modern Information Retrieval , 1983 .

[11] V. Clark,et al. Computer-aided multivariate analysis , 1991 .