Document Clustering in Reduced Dimension Vector Space

Document clustering is a popular tool for automatically organizing a large collection of texts. Clustering algorithms are usually applied to documents represented as vectors in a high dimensional term space. We investigate the use of Latent Semantic Analysis to create a new vector space, that is the optimal representation of the document collection. Documents are projected onto a small subspace of this vector space and clustered. We compare the performance of clustering algorithms when applied to documents represented in the full term space and in reduced dimension subspace of the LSA-generated vector space. We report significant improvements in cluster quality for LSA subspaces with optimal dimensionality. We discuss the procedure for determining the right number of dimensions for the subspace. Moreover, when this number is small, the total running time of the clustering algorithm is comparable to the one that uses the full term space.