An Efficient Ranking-Centered Density-Based Document Clustering Method

Document clustering is a popular method for discovering useful information from text data. This paper proposes an innovative hybrid document clustering method based on the novel concepts of ranking, density and shared neighborhood. We utilize ranked documents generated from a search engine to effectively build a graph of shared relevant documents. The high density regions in the graph are processed to form initial clusters. The clustering decisions are further refined using the shared neighborhood information. Empirical analysis shows that the proposed method is able to produce accurate and efficient solution as compared to relevant benchmarking methods.

[1]  Benno Stein,et al.  The optimum clustering framework: implementing the cluster hypothesis , 2011, Information Retrieval.

[2]  Qing He,et al.  Effective semi-supervised document clustering via active learning with instance-level constraints , 2011, Knowledge and Information Systems.

[3]  Chih-Jen Lin,et al.  Projected Gradient Methods for Nonnegative Matrix Factorization , 2007, Neural Computation.

[4]  Ray A. Jarvis,et al.  Clustering Using a Similarity Measure Based on Shared Near Neighbors , 1973, IEEE Transactions on Computers.

[5]  George Karypis,et al.  Document Clustering: The Next Frontier , 2018, Data Clustering: Algorithms and Applications.

[6]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[7]  Richi Nayak,et al.  The Heterogeneous Cluster Ensemble Method Using Hubness for Clustering Text Documents , 2013, WISE.

[8]  Vipin Kumar,et al.  Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data , 2003, SDM.

[9]  Vipin Kumar,et al.  Finding Topics in Collections of Documents: A Shared Nearest Neighbor Approach , 2003, Clustering and Information Retrieval.

[10]  Torsten Suel,et al.  Performance of compressed inverted list caching in search engines , 2008, WWW.

[11]  C. J. van Rijsbergen,et al.  The use of hierarchic clustering in information retrieval , 1971, Inf. Storage Retr..

[12]  Bruce Hajek,et al.  ADAPTIVE TRANSMISSION STRATEGIES AND ROUTING IN MOBILE RADIO NETWORKS. , 1983 .

[13]  Dunja Mladenic,et al.  Hubness-Based Clustering of High-Dimensional Data , 2015 .

[14]  Hua Li,et al.  Improving web search results using affinity graph , 2005, SIGIR '05.

[15]  Richi Nayak,et al.  Semi-supervised Document Clustering via Loci , 2015, WISE.

[16]  Sergei Vassilvitskii,et al.  Scalable K-Means by ranked retrieval , 2014, WSDM.