Unsupervised document clustering using multi-resolution latent semantic density analysis

To find meaningful groupings in a given document collection, it is essential to learn the right granularity for the domain, uncover core themes and attendant outliers, and derive suitable labels with which to characterize each of the resulting clusters. The outcome is therefore affected both by the choice of representation and by the behavior of the clustering algorithm. This paper advocates a strategy which combines density-based clustering with latent semantic feature extraction. Documents are first mapped into a latent semantic vector space, and then clustered in that space on the basis of a multi-resolution density measure. Empirical evidence gathered on several document collections suggests that this procedure is effective in identifying semantically sound document clusters.

[1]  Jerome R. Bellegarda,et al.  Latent Semantic Mapping: Principles & Applications , 2006, Latent Semantic Mapping.

[2]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[3]  Boris Kovalerchuk,et al.  Data mining in finance: advances in relational and hybrid methods , 2000 .

[4]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[5]  R. Bellegarda,et al.  Latent Semantic Mapping [ A data-driven framework for modeling global relationships implicit in large volumes of data ] , 2000 .

[6]  Hichem Frigui,et al.  Clustering by competitive agglomeration , 1997, Pattern Recognit..

[7]  Jean-Michel Jolion,et al.  Robust Clustering with Applications in Computer Vision , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[9]  Douglas H. Fisher,et al.  Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[10]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[11]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..

[12]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[13]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[14]  Santosh S. Vempala,et al.  Latent Semantic Indexing , 2000, PODS 2000.

[15]  Andrey Ptitsyn Class discovery analysis of the lung cancer gene expression data. , 2004, DNA and cell biology.

[16]  Benno Stein,et al.  Topic Identification: Framework and Application , 2022 .

[17]  Laurie J. Heyer,et al.  Exploring expression data: identification and analysis of coexpressed genes. , 1999, Genome research.

[18]  Hans-Peter Kriegel,et al.  Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications , 1998, Data Mining and Knowledge Discovery.

[19]  Robert D. Nowak,et al.  Learning Minimum Volume Sets , 2005, J. Mach. Learn. Res..

[20]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[21]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[22]  J.R. Bellegarda,et al.  Latent semantic mapping [information retrieval] , 2005, IEEE Signal Processing Magazine.

[23]  Raghu Krishnapuram,et al.  Fitting an unknown number of lines and planes to image data through compatible cluster merging , 1992, Pattern Recognit..

[24]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[25]  Jerome R. Bellegarda,et al.  Latent Semantic Mapping: Principles and Applications , 2008 .

[26]  Ellen M. Voorhees,et al.  Implementing agglomerative hierarchic clustering algorithms for use in document retrieval , 1986, Inf. Process. Manag..