Scalable density-based clustering with quality guarantees using random projections

Clustering offers significant insights in data analysis. Density-based algorithms have emerged as flexible and efficient techniques, able to discover high-quality and potentially irregularly shaped clusters. Here, we present scalable density-based clustering algorithms using random projections. Our clustering methodology achieves a speedup of two orders of magnitude compared with equivalent state-of-art density-based techniques, while offering analytical guarantees on the clustering quality in Euclidean space. Moreover, it does not introduce difficult to set parameters. We provide a comprehensive analysis of our algorithms and comparison with existing density-based algorithms.

[1]  Rafael Sachetto Oliveira,et al.  G-DBSCAN: A GPU Accelerated Algorithm for Density-based Clustering , 2013, ICCS.

[2]  Sanjay Chawla,et al.  Density-preserving projections for large-scale local anomaly detection , 2012, Knowledge and Information Systems.

[3]  Anil K. Jain,et al.  Data Clustering: A User's Dilemma , 2005, PReMI.

[4]  Inderjit S. Dhillon,et al.  Scalable and Memory-Efficient Clustering of Large-Scale Social Networks , 2012, 2012 IEEE 12th International Conference on Data Mining.

[5]  Sanjoy Dasgupta,et al.  Random projection trees and low dimensional manifolds , 2008, STOC.

[6]  Naren Ramakrishnan,et al.  Compression, clustering, and pattern discovery in very high-dimensional discrete-attribute data sets , 2005, IEEE Transactions on Knowledge and Data Engineering.

[7]  Arthur Zimek,et al.  A Framework for Clustering Uncertain Data , 2015, Proc. VLDB Endow..

[8]  Elke Achtert,et al.  DeLi-Clu: Boosting Robustness, Completeness, Usability, and Efficiency of Hierarchical Clustering by a Closest Pair Ranking , 2006, PAKDD.

[9]  Yonggang Zhang,et al.  Cludoop: An Efficient Distributed Density-Based Clustering for Big Data Using Hadoop , 2015, Int. J. Distributed Sens. Networks.

[10]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[11]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[12]  Johannes Schneider,et al.  Fast parameterless density-based clustering via random projections , 2013, CIKM.

[13]  Johannes Schneider,et al.  Solving Linear SVMs with Multiple 1D Projections , 2014, CIKM.

[14]  Woong-Kee Loh,et al.  Fast density-based clustering through dataset partition using graphics processing units , 2015, Inf. Sci..

[15]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[16]  Aristides Gionis,et al.  Clustering Aggregation , 2005, ICDE.

[17]  L. Hubert,et al.  Comparing partitions , 1985 .

[18]  Cor J. Veenman,et al.  A Maximum Variance Cluster Algorithm , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  Christian Böhm,et al.  Density-based clustering using graphics processors , 2009, CIKM.

[20]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[21]  Chabane Djeraba,et al.  Clustering by Random Projections , 2007, ICDM.

[22]  Roger Wattenhofer,et al.  Distributed Coloring Depending on the Chromatic Number or the Neighborhood Growth , 2011, SIROCCO.

[23]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[24]  Alexander Hinneburg,et al.  DENCLUE 2.0: Fast Clustering Based on Kernel Density Estimation , 2007, IDA.

[25]  M. Narasimha Murty,et al.  Two-level k-means clustering algorithm for k-tau relationship establishment and linear-time classification , 2010, Pattern Recognit..

[26]  Anil K. Jain Data Clustering: User's Dilemma , 2007, MLDM.

[27]  Johannes Schneider,et al.  On Randomly Projected Hierarchical Clustering with Guarantees , 2014, SDM.