Mr . Scan : A Hybrid / Hybrid Extreme Scale Density Based Clustering Algorithm

Density-based clustering algorithms are a widely-used class of data mining techniques that can find irregularly shaped clusters and cluster data without prior knowledge of the number of clusters the data contains. DBSCAN is the most well-known density-based clustering algorithm. We introduce our extension of DBSCAN, called Mr. Scan, which uses a hybrid/hybrid parallel implementation that combines the MRNet tree-based distribution network with GPU-equipped nodes. Mr. Scan avoids the problems encountered in other parallel versions of DBSCAN, such as scalability limits, reduction in output quality at large scales, and the inability to effectively process dense regions of data. Mr. Scan uses effective data partitioning and a new merging technique to allow data sets to be broken into independently processable partitions without the reduction in quality or large amount of node-to-node communication seen in other parallel versions of DBSCAN. The dense box algorithm designed as part of Mr. Scan allows for dense regions to be detected and clustered without the need to individually compare all points in these regions to one another. Mr. Scan was tested on both a geolocated Twitter dataset and image data obtained from the Sloan Digital Sky Survey. In testing Mr. Scan we performed end-to-end benchmarks measuring complete application run time from reading raw unordered input point data from the file system to writing the final clustered output to the file system. The use of end-to-end benchmarking gives a clear picture of the performance that can be expected from Mr. Scan in real world use cases. At its largest scale, Mr. Scan clustered 6.5 billion points from the Twitter dataset on 8,192 GPU nodes on Cray Titan in 7.5 minutes.

[1]  Sohail Asghar,et al.  Critical analysis of DBSCAN variations , 2010, 2010 International Conference on Information and Emerging Technologies.

[2]  Christian Böhm,et al.  Density-based clustering using graphics processors , 2009, CIKM.

[3]  J. Roth,et al.  Molecular dynamics simulations of cluster distribution from femtosecond laser ablation in aluminum , 2011 .

[4]  B.P. Miller,et al.  MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[5]  Di Ma,et al.  MR-DBSCAN: An Efficient Parallel Density-Based Clustering Algorithm Using MapReduce , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[6]  Marzena Kryszkiewicz,et al.  TI-DBSCAN: Clustering with DBSCAN by Means of the Triangle Inequality , 2010, RSCTC.

[7]  Martin Schulz,et al.  Scalable dynamic binary instrumentation for Blue Gene/L , 2005, CARN.

[8]  Slava Kisilevich,et al.  P-DBSCAN: a density based clustering algorithm for exploration and analysis of attractive areas using collections of geo-tagged photos , 2010, COM.Geo '10.

[9]  Hans-Peter Kriegel,et al.  A Fast Parallel Clustering Algorithm for Large Spatial Databases , 1999, Data Mining and Knowledge Discovery.

[10]  Lindsay T. Graham,et al.  A Review of Facebook Research in the Social Sciences , 2012, Perspectives on psychological science : a journal of the Association for Psychological Science.

[11]  Frank Mueller,et al.  A Library Implementation of POSIX Threads under UNIX , 1993, USENIX Winter.

[12]  Martin Schulz,et al.  Stack Trace Analysis for Large Scale Debugging , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[13]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[14]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[15]  Jitendra Kumar,et al.  Cluster Analysis-Based Approaches for Geospatiotemporal Data Mining of Massive Data Sets for Identification of Forest Threats , 2011, ICCS.

[16]  Hans-Peter Kriegel,et al.  DBDC: Density Based Distributed Clustering , 2004, EDBT.

[17]  Robert Latham,et al.  Scalable I/O forwarding framework for high-performance computing systems , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[18]  Nello Cristianini,et al.  Effects of the recession on public mood in the UK , 2012, WWW.

[19]  Wei-keng Liao,et al.  A new scalable parallel DBSCAN algorithm using the disjoint-set data structure , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[20]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[21]  Martin Schulz,et al.  Scalable temporal order analysis for large scale debugging , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[22]  P. Wozniak,et al.  RAPTOR‐scan: Identifying and Tracking Objects Through Thousands of Sky Images , 2004 .

[23]  Pradeep Dubey,et al.  Pardicle: Parallel Approximate Density-Based Clustering , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[24]  Anand Karandikar,et al.  Clustering short status messages: A topic model based approach , 2010 .

[25]  Barton P. Miller,et al.  Mr. Scan: Extreme scale density-based clustering using a tree-based network of GPGPU nodes , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[26]  Vasileios Lampos On voting intentions inference from Twitter content: a case study on UK 2010 General Election , 2012, ArXiv.

[27]  Martin Schulz,et al.  Clustering performance data efficiently at massive scales , 2010, ICS '10.

[28]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[29]  Bi-Ru Dai,et al.  Efficient Map/Reduce-Based DBSCAN Algorithm with Optimized Data Partition , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[30]  Marzena Kryszkiewicz,et al.  Faster Clustering with DBSCAN , 2005, Intelligent Information Systems.

[31]  Elke Achtert,et al.  Spatial Outlier Detection: Data, Algorithms, Visualizations , 2011, SSTD.

[32]  Nello Cristianini,et al.  Nowcasting Events from the Social Web with Statistical Learning , 2012, TIST.

[33]  Barton P. Miller,et al.  The Anatomy of Mr. Scan: A Dissection of Performance of an Extreme Scale GPU-Based Clustering Algorithm , 2014, 2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems.

[34]  Aron Culotta,et al.  Lightweight methods to estimate influenza rates and alcohol sales volume from Twitter messages , 2012, Language Resources and Evaluation.