Improving DBSCAN's execution time by using a pruning technique on bit vectors

Clustering is the process of assigning a set of physical or abstract objects into previously unknown groups. The goal of clustering is to group similar objects into the same clusters and dissimilar objects into different clusters. Similarities between objects are evaluated by using the attribute values of objects. There are many clustering algorithms in the literature; among them, DBSCAN is a well known density-based clustering algorithm. We improve DBSCAN's execution time performance for binary data sets and Hamming distances. We achieve considerable speed gains by using a novel pruning technique, as well as bit vectors, and binary operations. Our novel method effectively discards distant neighbors of an object and computes only the distances between an object and its possible neighbors. By discarding distant neighbors, we avoid unnecessary distance computations and use less CPU time when compared with the conventional DBSCAN algorithm. However, the accuracy of our method is identical to that of the original DBSCAN. Experimental test results on real and synthetic data sets demonstrate that, by using our pruning technique, we obtain considerably faster execution time results compared to DBSCAN.

[1]  Giandomenico Spezzano,et al.  An adaptive flocking algorithm for performing approximate clustering , 2009, Inf. Sci..

[2]  Xin Rui,et al.  An Improved Clustering Algorithm , 2008, 2008 International Symposium on Computational Intelligence and Design.

[3]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[4]  Duoqian Miao,et al.  DIVFRP: An automatic divisive hierarchical clustering method based on the furthest reference points , 2008, Pattern Recognit. Lett..

[5]  Cheng-Fa Tsai,et al.  KIDBSCAN: A New Efficient Data Clustering Algorithm , 2006, ICAISC.

[6]  Yike Guo,et al.  High Performance Data Mining , 2002, Springer US.

[7]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[8]  D.K. Bhattacharyya,et al.  An improved sampling-based DBSCAN for large spatial databases , 2004, International Conference on Intelligent Sensing and Information Processing, 2004. Proceedings of.

[9]  Montserrat Ros,et al.  A hamming distance based VLIW/EPIC code compression technique , 2004, CASES '04.

[10]  P. Viswanath,et al.  Rough-DBSCAN: A fast hybrid density based clustering method for large data sets , 2009, Pattern Recognit. Lett..

[11]  Pentti O. A. Haikonen Robot Brains: Circuits and Systems for Conscious Machines , 2007 .

[12]  Jian Pei,et al.  Continuous K-Means Monitoring with Low Reporting Cost in Sensor Networks , 2009, IEEE Transactions on Knowledge and Data Engineering.

[13]  Cheng-Fa Tsai,et al.  EIDBSCAN: An Extended Improving DBSCAN algorithm with sampling techniques , 2010, Int. J. Bus. Intell. Data Min..

[14]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[15]  Olga Sourina,et al.  Effective clustering and boundary detection algorithm based on Delaunay triangulation , 2008, Pattern Recognit. Lett..

[16]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[17]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[18]  Mohamed A. Ismail,et al.  A distance-relatedness dynamic model for clustering high dimensional data of arbitrary shapes and densities , 2009, Pattern Recognit..

[19]  Hans-Peter Kriegel,et al.  A Fast Parallel Clustering Algorithm for Large Spatial Databases , 1999, Data Mining and Knowledge Discovery.

[20]  Massimo Coppola,et al.  Experiments in Parallel Clustering with DBSCAN , 2001, Euro-Par.

[21]  Arie Shoshani,et al.  Optimizing bitmap indices with efficient compression , 2006, TODS.

[22]  Cao Jing,et al.  Approaches for scaling DBSCAN algorithm to large spatial databases , 2000 .

[23]  Bidyut Baran Chaudhuri,et al.  A novel genetic algorithm for automatic clustering , 2004, Pattern Recognit. Lett..

[24]  Xia Li,et al.  A Hybrid Clustering Algorithm , 2009, 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery.

[25]  Mohamed A. Ismail,et al.  An efficient density based clustering algorithm for large databases , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[26]  Owen Kaser,et al.  Sorting improves word-aligned bitmap indexes , 2010, Data Knowl. Eng..