Cludoop: An Efficient Distributed Density-Based Clustering for Big Data Using Hadoop

Density-based clustering for big data is critical for many modern applications ranging from Internet data processing to massive-scale moving object management. This paper proposes Cludoop algorithm, an efficient distributed density-based clustering for big data using Hadoop. First, we propose a serial clustering algorithm CluC by leveraging cell partition optimization and c-cluster to fast find clusters. CluC completes classification of the points using the relationships of connected cells around points instead of expensive completed neighbor query, which significantly reduce the number of distance calculations. Second, we propose the Cludoop, which can efficiently cluster very-large-scale data in parallel using already existing data partition on Map/Reduce platform. It employs the proposed serial clustering CluC as a plugged-in clustering on parallel mapper, along with a cell description instead of completed cell in transmission to reduce both network and I/O costs. Guided by proposed cell-based principles, we also design a Merging-Refinement-Merging 3-step framework to merge c-clusters on the overlay of assigned preclustering result on reducer. Finally, our comprehensive experimental evaluation on 10 network-connected commercial PCs, using both huge-volume real and synthetic data, demonstrates (1) the effectiveness of our algorithm in finding correct clusters with arbitrary shape and (2) the fact that our proposed algorithm exhibits better scalability and efficiency than state-of-the-art method.

[1]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[2]  Di Ma,et al.  MR-DBSCAN: An Efficient Parallel Density-Based Clustering Algorithm Using MapReduce , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[3]  Christian Böhm,et al.  Density-based clustering using graphics processors , 2009, CIKM.

[4]  Benjamin Moseley,et al.  Fast clustering using MapReduce , 2011, KDD.

[5]  Xing Xie,et al.  GeoLife: A Collaborative Social Networking Service among User, Location and Trajectory , 2010, IEEE Data Eng. Bull..

[6]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[7]  Hans-Peter Kriegel,et al.  Scalable Density-Based Distributed Clustering , 2004, PKDD.

[8]  何耀彬,et al.  MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data , 2013 .

[9]  Hans-Peter Kriegel,et al.  A Fast Parallel Clustering Algorithm for Large Spatial Databases , 1999, Data Mining and Knowledge Discovery.

[10]  Peter Scheuermann,et al.  Efficient Parallel Hierarchical Clustering , 2004, Euro-Par.

[11]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[12]  Kyuseok Shim,et al.  MapReduce Algorithms for Big Data Analysis , 2012, Proc. VLDB Endow..

[13]  Christos Faloutsos,et al.  Clustering very large multi-dimensional datasets with MapReduce , 2011, KDD.

[14]  Hans-Peter Kriegel,et al.  DBDC: Density Based Distributed Clustering , 2004, EDBT.

[15]  Hans-Peter Kriegel,et al.  Parallel Density-Based Clustering of Complex Objects , 2006, PAKDD.

[16]  Rafael Sachetto Oliveira,et al.  G-DBSCAN: A GPU Accelerated Algorithm for Density-based Clustering , 2013, ICCS.

[17]  Bi-Ru Dai,et al.  Efficient Map/Reduce-Based DBSCAN Algorithm with Optimized Data Partition , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[18]  Hans-Peter Kriegel,et al.  Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications , 1998, Data Mining and Knowledge Discovery.

[19]  Christian Böhm,et al.  Data Mining Using Graphics Processing Units , 2009, Trans. Large Scale Data Knowl. Centered Syst..

[20]  Xing Xie,et al.  T-Drive: Enhancing Driving Directions with Taxi Drivers' Intelligence , 2013, IEEE Transactions on Knowledge and Data Engineering.