AnyDBC: An Efficient Anytime Density-based Clustering Algorithm for Very Large Complex Datasets

The density-based clustering algorithm DBSCAN is a state-of-the-art data clustering technique with numerous applications in many fields. However, its O(n2) time complexity still remains a severe weakness. In this paper, we propose a novel anytime approach to cope with this problem by reducing both the range query and the label propagation time of DBSCAN. Our algorithm, called AnyDBC, compresses the data into smaller density-connected subsets called primitive clusters and labels objects based on connected components of these primitive clusters for reducing the label propagation time. Moreover, instead of passively performing the range query for all objects like existing techniques, AnyDBC iteratively and actively learns the current cluster structure of the data and selects a few most promising objects for refining clusters at each iteration. Thus, in the end, it performs substantially fewer range queries compared to DBSCAN while still guaranteeing the exact final result of DBSCAN. Experiments show speedup factors of orders of magnitude compared to DBSCAN and its fastest variants on very large real and synthetic complex datasets.

[1]  Christian Böhm,et al.  Active Density-Based Clustering , 2013, 2013 IEEE 13th International Conference on Data Mining.

[2]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[3]  Claudia Plant,et al.  A Similarity Model and Segmentation Algorithm for White Matter Fiber Tracts , 2012, 2012 IEEE 12th International Conference on Data Mining.

[4]  A. Tramacere,et al.  γ-ray DBSCAN: A clustering algorithm applied to Fermi-LAT γ-ray data , 2012 .

[5]  Huan Liu,et al.  '1+1>2': merging distance and density based clustering , 2001, Proceedings Seventh International Conference on Database Systems for Advanced Applications. DASFAA 2001.

[6]  D.K. Bhattacharyya,et al.  An improved sampling-based DBSCAN for large spatial databases , 2004, International Conference on Intelligent Sensing and Information Processing, 2004. Proceedings of.

[7]  A Ade Gunawan,et al.  A faster algorithm for DBSCAN , 2013 .

[8]  Shlomo Zilberstein,et al.  Using Anytime Algorithms in Intelligent Systems , 1996, AI Mag..

[9]  Khaled Mahar,et al.  Using grid for accelerating density-based clustering , 2008, 2008 8th IEEE International Conference on Computer and Information Technology.

[10]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[11]  Christian Böhm,et al.  Efficient Anytime Density-based Clustering , 2013, SDM.

[12]  Wei-keng Liao,et al.  A new scalable parallel DBSCAN algorithm using the disjoint-set data structure , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Hans-Peter Kriegel,et al.  Efficient density-based clustering of complex objects , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[14]  Howard J. Hamilton,et al.  DBRS: A Density-Based Spatial Clustering Method with Random Sampling , 2003, PAKDD.

[15]  Yufei Tao,et al.  DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation , 2015, SIGMOD Conference.

[16]  Takahiro Matsuda,et al.  An Anytime Algorithm for Camera-Based Character Recognition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[17]  Jing Cao,et al.  Combining Sampling Technique with DBSCAN Algorithm for Clustering Large Spatial Databases , 2000, PAKDD.

[18]  Christian Böhm,et al.  Anytime density-based clustering of complex data , 2014, Knowledge and Information Systems.

[19]  Lei Chen,et al.  On The Marriage of Lp-norms and Edit Distance , 2004, VLDB.

[20]  James Bailey,et al.  Information theoretic measures for clusterings comparison: is a correction for chance necessary? , 2009, ICML '09.