G2P: A Partitioning Approach for Processing DBSCAN with MapReduce

One of the most important aspects to consider when computing large data sets is to distribute and parallelize the analysis algorithms. A distributed system presents a good performance if the workload is properly balanced. It is expected that the computing time is directly related to the processing time on the node where the processing takes longer. This paper aims at proposing a data partitioning strategy that takes into account partition balance and that is generic for spatial data. Our proposed solution is based on a grid model data structure that is further transformed into a graph partitioning problem, where we finally compute the partitions. Our proposed approach is used on the distributed DBSCAN algorithm and it is focused on finding density areas in a large data set using MapReduce. We call our approach G2P (Grid and Graph Partitioning) and we show via massive experiments that G2P presents great quality data partitioning for the distributed DBSCAN algorithm compared to the competitors. We believe that G2P is not only suitable for DBSCAN algorithm, but also to execute spatial join operations and distance based range queries to name to a few.

[1]  Christian S. Jensen,et al.  Discovery of convoys in trajectory databases , 2008, Proc. VLDB Endow..

[2]  Beng Chin Ooi,et al.  Continuous Clustering of Moving Objects , 2007, IEEE Transactions on Knowledge and Data Engineering.

[3]  Javam C. Machado,et al.  Towards an Efficient and Distributed DBSCAN Algorithm Using MapReduce , 2014, ICEIS.

[4]  Yifan Li,et al.  Clustering moving objects , 2004, KDD.

[5]  Barton P. Miller,et al.  Mr. Scan: Extreme scale density-based clustering using a tree-based network of GPGPU nodes , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[6]  Javam C. Machado,et al.  Efficient and Distributed DBScan Algorithm Using MapReduce to Detect Density Areas on Traffic Data , 2014, ICEIS.

[7]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[8]  Jimmy J. Lin,et al.  Book Reviews: Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer , 2010, CL.

[9]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[10]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[11]  Slava Kisilevich,et al.  P-DBSCAN: a density based clustering algorithm for exploration and analysis of attractive areas using collections of geo-tagged photos , 2010, COM.Geo '10.

[12]  Konstantin Andreev,et al.  Balanced Graph Partitioning , 2004, SPAA '04.

[13]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[14]  Dino Pedreschi,et al.  Unveiling the complexity of human mobility by querying and mining massive trajectory data , 2011, The VLDB Journal.

[15]  Christian S. Jensen,et al.  Effective Online Group Discovery in Trajectory Databases , 2013, IEEE Transactions on Knowledge and Data Engineering.

[16]  Di Ma,et al.  MR-DBSCAN: An Efficient Parallel Density-Based Clustering Algorithm Using MapReduce , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[17]  Younghoon Kim,et al.  DBCURE-MR: An efficient density-based clustering algorithm for large data using MapReduce , 2014, Inf. Syst..

[18]  G. Nolan,et al.  Computational solutions to large-scale data management and analysis , 2010, Nature Reviews Genetics.

[19]  Bi-Ru Dai,et al.  Efficient Map/Reduce-Based DBSCAN Algorithm with Optimized Data Partition , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[20]  Petko Bakalov,et al.  On-line discovery of flock patterns in spatio-temporal data , 2009, GIS.

[21]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[22]  Dilip B. Kotak,et al.  GRIDBSCAN: GRId Density-Based Spatial Clustering of Applications with Noise , 2006, 2006 IEEE International Conference on Systems, Man and Cybernetics.