Adaptive Density-Based Spatial Clustering for Massive Data Analysis

Clustering is a classical research field due to its broad applications in data mining such as emotion detection, event extraction and topic discovery. It aims to discover intrinsic patterns which can be formed as clusters from a collection of data. Significant progress have been made by the Density-based Spatial Clustering of Applications with Noise (DBSCAN) and its variants. However, there is a major limitation that current density-based algorithms suffer from linear connection problem, where they perform poorly to discriminate objective clusters which are “connected” by a few data points. Moreover, the parameter setting and the time cost make it hard to be well-adapted in massive data analysis. To address these problems, we propose a novel adaptive density-based spatial clustering algorithm called Ada-DBSCAN, which consists of a data block splitter and a data block merger, coordinated by local clustering and global clustering. We conduct extensive experiments on both artificial and real-world datasets to evaluate the effectiveness of Ada-DBSCAN. Experimental results show that our algorithm evidently outperforms several strong baselines in both clustering accuracy and human evaluation. Besides, Ada-DBSCAN shows significant improvement of efficiency compared with DBSCAN.

[1]  Zhe Xiao,et al.  Maritime Traffic Probabilistic Forecasting Based on Vessels’ Waterway Patterns and Motion Behaviors , 2017, IEEE Transactions on Intelligent Transportation Systems.

[2]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[3]  Victor Pankratius,et al.  Optimizing Parallel Clustering Throughput in Shared Memory , 2017, IEEE Transactions on Parallel and Distributed Systems.

[4]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[5]  Bi-Ru Dai,et al.  Efficient Map/Reduce-Based DBSCAN Algorithm with Optimized Data Partition , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[6]  A. Rama Mohan Reddy,et al.  A fast DBSCAN clustering algorithm by accelerating neighbor searching using Groups method , 2016, Pattern Recognit..

[7]  Richang Hong,et al.  Point-of-Interest Recommendations: Learning Potential Check-ins from Friends , 2016, KDD.

[8]  Wei-keng Liao,et al.  A new scalable parallel DBSCAN algorithm using the disjoint-set data structure , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  S. R,et al.  Data Mining with Big Data , 2017, 2017 11th International Conference on Intelligent Systems and Control (ISCO).

[10]  Qi Si,et al.  Effective Mapping of Urban Areas Using ENVISAT ASAR, Sentinel-1A, and HJ-1-C Data , 2017, IEEE Geoscience and Remote Sensing Letters.

[11]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[12]  Ricardo J. G. B. Campello,et al.  Density-Based Clustering Based on Hierarchical Density Estimates , 2013, PAKDD.

[13]  Lei Chen,et al.  Enhancing Privacy and Availability for Data Clustering in Intelligent Electrical Service of IoT , 2019, IEEE Internet of Things Journal.

[14]  Tinghuai Ma,et al.  An efficient and scalable density-based clustering algorithm for datasets with complex structures , 2016, Neurocomputing.

[15]  Cheng-Fa Tsai,et al.  GF-DBSCAN: a new efficient and effective data clustering technique for large databases , 2009 .

[16]  Christos-Savvas Bouganis,et al.  ARC 2014 , 2015, ACM Trans. Reconfigurable Technol. Syst..

[17]  Ira Assent,et al.  AnyDBC: An Efficient Anytime Density-based Clustering Algorithm for Very Large Complex Datasets , 2016, KDD.

[18]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[19]  Bo Yuan,et al.  Efficient distributed clustering using boundary information , 2018, Neurocomputing.

[20]  Christian S. Jensen,et al.  Effective Online Group Discovery in Trajectory Databases , 2013, IEEE Transactions on Knowledge and Data Engineering.

[21]  Xuelong Li,et al.  DSets-DBSCAN: A Parameter-Free Clustering Algorithm , 2016, IEEE Transactions on Image Processing.

[22]  Avory Bryant,et al.  RNN-DBSCAN: A Density-Based Clustering Algorithm Using Reverse Nearest Neighbor Density Estimates , 2018, IEEE Transactions on Knowledge and Data Engineering.

[23]  Alfredo Ferro,et al.  Enhancing density-based clustering: Parameter reduction and outlier detection , 2013, Inf. Syst..

[24]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[25]  Yufei Tao,et al.  Fast Euclidean OPTICS with Bounded Precision in Low Dimensional Space , 2018, SIGMOD Conference.

[26]  Fawzy A. Torkey,et al.  An Enhanced Density Based Spatial clustering of Applications with Noise , 2009, DMIN.

[27]  Woong-Kee Loh,et al.  Fast density-based clustering through dataset partition using graphics processing units , 2015, Inf. Sci..

[28]  Deniz Yuret,et al.  Locally Scaled Density Based Clustering , 2007, ICANNGA.

[29]  Kamalakar Karlapalem,et al.  A Simple Yet Effective Data Clustering Algorithm , 2006, Sixth International Conference on Data Mining (ICDM'06).

[30]  Liyuan Liu,et al.  TrioVecEvent: Embedding-Based Online Local Event Detection in Geo-Tagged Tweet Streams , 2017, KDD.

[31]  Matteo Dell'Amico,et al.  NG-DBSCAN: Scalable Density-Based Clustering for Arbitrary Data , 2016, Proc. VLDB Endow..

[32]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .