A Fast Approach to Clustering Datasets using DBSCAN and Pruning Algorithms

Among the various clustering algorithms, DBSCAN is an effective clustering algorithm used in many applications. It has various advantages like no a priori assumption needed about the number of clusters, can find arbitrarily shaped clusters and can perform well even in the presence of outliers. However, the performance is seriously affected when the dataset size becomes large. Moreover, the selection of the two input parameters, Eps and MinPts, has a great impact on the clustering performance. To solve these two problems, this paper modifies the traditional DBSCAN algorithm in two manners. The first method uses K-dimensional tree instead of the traditional R-tree algorithm while the second method includes a locally sensitive hash procedure to speed up the process of clustering and increase the efficiency of clustering. The algorithms use a k-distance graph method to automatically calculate Eps and MinPts. Experimental results show that both the algorithms are efficient in terms of scalability and speeds up the clustering process in an efficient manner.

[1]  Yang Fan,et al.  A Density-based Path Clustering Algorithm , 2011, 2011 International Conference on Intelligent Computation and Bio-Medical Instrumentation.

[2]  Cheng-Fa Tsai,et al.  EIDBSCAN: An Extended Improving DBSCAN algorithm with sampling techniques , 2010, Int. J. Bus. Intell. Data Min..

[3]  Ferenc Kovács,et al.  Clustering techniques utilized in web usage mining , 2006 .

[4]  Jing Li,et al.  A new hybrid method based on partitioning-based DBSCAN and ant clustering , 2011, Expert Syst. Appl..

[5]  Bidyut Baran Chaudhuri,et al.  A novel genetic algorithm for automatic clustering , 2004, Pattern Recognit. Lett..

[6]  Edwin Lughofer,et al.  Impact of object extraction methods on classification performance in surface inspection systems , 2010, Machine Vision and Applications.

[7]  Tao Jiang,et al.  Accelerated similarity searching and clustering of large compound sets by geometric embedding and locality sensitive hashing , 2010, Bioinform..

[8]  Xia Li,et al.  A Hybrid Clustering Algorithm , 2009, 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery.

[9]  Giandomenico Spezzano,et al.  An adaptive flocking algorithm for performing approximate clustering , 2009, Inf. Sci..

[10]  Christopher Krügel,et al.  Scalable, Behavior-Based Malware Clustering , 2009, NDSS.

[11]  D.K. Bhattacharyya,et al.  An improved sampling-based DBSCAN for large spatial databases , 2004, International Conference on Intelligent Sensing and Information Processing, 2004. Proceedings of.

[12]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[13]  R. K. Krishna,et al.  An Energy-efficient Grid based Clustering Topology for a Wireless Sensor Network , 2012 .

[14]  Hans-Peter Kriegel,et al.  A Fast Parallel Clustering Algorithm for Large Spatial Databases , 1999, Data Mining and Knowledge Discovery.

[15]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[16]  Mohan Kumar,et al.  High accuracy context recovery using clustering mechanisms , 2009, 2009 IEEE International Conference on Pervasive Computing and Communications.

[17]  Chih-Ping Wei,et al.  Empirical comparison of fast clustering algorithms for large data sets , 2000, Proceedings of the 33rd Annual Hawaii International Conference on System Sciences.

[18]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.