An efficient and scalable density-based clustering algorithm for datasets with complex structures

As a research branch of data mining, clustering, as an unsupervised learning scheme, focuses on assigning objects in the dataset into several groups, called clusters, without any prior knowledge. Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is one of the most widely used clustering algorithms for spatial datasets, which can detect any shapes of clusters and can automatically identify noise points. However, there are several troublesome limitations of DBSCAN: (1) the performance of the algorithm depends on two specified parameters, e and MinPts in which e represents the maximum radius of a neighborhood from the observing point and MinPts means the minimum number of data points contained in such a neighborhood. (2) The time consumption for searching the nearest neighbors of each object is intolerable in the cluster expansion. (3) Selecting different starting points results in quite different consequences. (4) DBSCAN is unable to identify adjacent clusters of various densities. In addition to these restrictions about DBSCAN mentioned above, the identification of border points is often ignored. In our paper, we successfully solve the above problems. Firstly, we improve the traditional locality sensitive hashing method to implement fast query of nearest neighbors. Secondly, several definitions are redefined on the basis of the influence space of each object, which takes the nearest neighbors and the reverse nearest neighbors into account. The influence space is proved to be sensitive to local density changes to successfully reduce the amount of parameters and identify adjacent clusters of different densities. Moreover, this new relationship based on the influence space makes the insensitivity to the ordering of inputting points possible. Finally, a new concept-core density reachable based on the influence space is put forward which aims to distinguish between border objects and noisy objects. Several experiments are performed which demonstrate that the performance of our proposed algorithm is better than the traditional DBSCAN algorithm and the improved algorithm IS-DBSCAN.

[1]  Massimo Coppola,et al.  Experiments in Parallel Clustering with DBSCAN , 2001, Euro-Par.

[2]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[3]  Christian Böhm,et al.  Density-based clustering using graphics processors , 2009, CIKM.

[4]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[5]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[6]  Mark de Berg,et al.  The Priority R-tree: a practically efficient and worst-case optimal R-tree , 2004, SIGMOD '04.

[7]  Ramesh C. Jain,et al.  Similarity indexing with the SS-tree , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[8]  Alfredo Ferro,et al.  Enhancing density-based clustering: Parameter reduction and outlier detection , 2013, Inf. Syst..

[9]  Kai Li,et al.  Reckon the Parameter of DBSCAN for Multi-density Data Sets with Constraints , 2009, 2009 International Conference on Artificial Intelligence and Computational Intelligence.

[10]  Xiao Wang,et al.  An Efficient Density-based Clustering Algorithm Combined with Representative Set ⋆ , 2013 .

[11]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[12]  James H. Garrett,et al.  A density-based spatial clustering approach for defining local indicators of drinking water distribution pipe breakage , 2011, Adv. Eng. Informatics.

[13]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[14]  Mutlu Mete,et al.  Fast density-based lesion detection in dermoscopy images , 2011, Comput. Medical Imaging Graph..

[15]  Mohamed A. Ismail,et al.  An efficient density based clustering algorithm for large databases , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[16]  Fawzy A. Torkey,et al.  Scalable Varied Density Clustering Algorithm for Large Datasets , 2010, J. Softw. Eng. Appl..

[17]  Peng Liu,et al.  VDBSCAN: Varied Density Based Spatial Clustering of Applications with Noise , 2007, 2007 International Conference on Service Systems and Service Management.

[18]  Shin'ichi Satoh,et al.  The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[19]  Z. Elouedi,et al.  DBSCAN-GM: An improved clustering method based on Gaussian Means and DBSCAN techniques , 2012, 2012 IEEE 16th International Conference on Intelligent Engineering Systems (INES).

[20]  Anthony K. H. Tung,et al.  Ranking Outliers Using Symmetric Neighborhood Relationship , 2006, PAKDD.

[21]  H. Frijlink,et al.  A density-based segmentation for 3D images, an application for X-ray micro-tomography. , 2012, Analytica chimica acta.

[22]  Yan Shi,et al.  An adaptive spatial clustering algorithm based on delaunay triangulation , 2011, Comput. Environ. Urban Syst..

[23]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[24]  Rafael Sachetto Oliveira,et al.  G-DBSCAN: A GPU Accelerated Algorithm for Density-based Clustering , 2013, ICCS.

[25]  Huan Liu,et al.  '1+1>2': merging distance and density based clustering , 2001, Proceedings Seventh International Conference on Database Systems for Advanced Applications. DASFAA 2001.

[26]  Hans-Peter Kriegel,et al.  A Fast Parallel Clustering Algorithm for Large Spatial Databases , 1999, Data Mining and Knowledge Discovery.

[27]  Bin Gu,et al.  Feasibility and Finite Convergence Analysis for Accurate On-Line $\nu$-Support Vector Machine , 2013, IEEE Transactions on Neural Networks and Learning Systems.

[28]  Bi-Ru Dai,et al.  Efficient Map/Reduce-Based DBSCAN Algorithm with Optimized Data Partition , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[29]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[30]  Jian-Huang Lai,et al.  APSCAN: A parameter free algorithm for clustering , 2011, Pattern Recognit. Lett..

[31]  Beng Chin Ooi,et al.  Indexing the Distance: An Efficient Method to KNN Processing , 2001, VLDB.

[32]  Abdullah Al-Dhelaan,et al.  Improved locality-sensitive hashing method for the approximate nearest neighbor problem , 2014 .

[33]  Lida Xu,et al.  A local-density based spatial clustering algorithm with noise , 2007, Inf. Syst..

[34]  Jianmin Wang,et al.  iPoc: A Polar Coordinate Based Indexing Method for Nearest Neighbor Search in High Dimensional Space , 2010, WAIM.

[35]  D. Massart,et al.  Looking for natural patterns in data: Part 1. Density-based approach , 2001 .

[36]  Horia Ciocarlie,et al.  Anomaly detection in data mining. Hybrid approach between filtering-and-refinement and DBSCAN , 2011, 2011 6th IEEE International Symposium on Applied Computational Intelligence and Informatics (SACI).

[37]  Min Chen,et al.  Parallel DBSCAN with Priority R-tree , 2010, 2010 2nd IEEE International Conference on Information Management and Engineering.

[38]  Beng Chin Ooi,et al.  iDistance: An adaptive B+-tree based indexing method for nearest neighbor search , 2005, TODS.

[39]  Mete Celik,et al.  Anomaly detection in temperature data using DBSCAN algorithm , 2011, 2011 International Symposium on Innovations in Intelligent Systems and Applications.

[40]  Wei-keng Liao,et al.  A new scalable parallel DBSCAN algorithm using the disjoint-set data structure , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[41]  Kristen Grauman,et al.  Kernelized locality-sensitive hashing for scalable image search , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[42]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[43]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[44]  Xue-Jie Zhang,et al.  A Linear DBSCAN Algorithm Based on LSH , 2007, 2007 International Conference on Machine Learning and Cybernetics.

[45]  Jie Gong,et al.  Data processing for real-time construction site spatial modeling , 2008 .

[46]  Philip K. Hopke,et al.  Cluster analysis of single particle mass spectra measured at Flushing, NY , 2006 .

[47]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD 2000.

[48]  Christian Trefftz,et al.  Memory-efficient implementation of a graphics processor-based cluster detection algorithm for large spatial databases , 2010, 2010 IEEE International Conference on Electro/Information Technology.

[49]  Yasser El-Sonbaty,et al.  Enhanced Density Based Algorithm for Clustering Large Datasets , 2009, Computer Recognition Systems 3.