Scalable parallel OPTICS data clustering using graph algorithmic techniques

OPTICS is a hierarchical density-based data clustering algorithm that discovers arbitrary-shaped clusters and eliminates noise using adjustable reachability distance thresholds. Parallelizing OPTICS is considered challenging as the algorithm exhibits a strongly sequential data access order. We present a scalable parallel OPTICS algorithm (POPTICS) designed using graph algorithmic concepts. To break the data access sequentiality, POPTICS exploits the similarities between the OPTICS algorithm and PRIM's Minimum Spanning Tree algorithm. Additionally, we use the disjoint-set data structure to achieve a high parallelism for distributed cluster extraction. Using high dimensional datasets containing up to a billion floating point numbers, we show scalable speedups of up to 27.5 for our OpenMP implementation on a 40-core shared-memory machine, and up to 3,008 for our MPI implementation on a 4,096-core distributed-memory machine. We also show that the quality of the results given by POPTICS is comparable to those given by the classical OPTICS algorithm.

[1]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[2]  Giuseppe Carenini,et al.  Using the Omega Index for Evaluating Abstractive Community Detection , 2012, EvalMetrics@NAACL-HLT.

[3]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[4]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[5]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[6]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[7]  Hae-Sang Park,et al.  A simple and fast algorithm for K-medoids clustering , 2009, Expert Syst. Appl..

[8]  G. Lemson,et al.  Halo and Galaxy Formation Histories from the Millennium Simulation: Public release of a VO-oriented and SQL-queryable database for studying the evolution of galaxies in the LambdaCDM cosmogony , 2006, astro-ph/0608019.

[9]  Oxford,et al.  Breaking the hierarchy of galaxy formation , 2005, astro-ph/0511338.

[10]  Jörg Sander,et al.  Semi-supervised Density-Based Clustering , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[11]  Weizhong Zhao,et al.  Research on Parallel DBSCAN Algorithm Design Based on MapReduce , 2011 .

[12]  Thanh-Tung Cao,et al.  Scalable parallel minimum spanning forest computation , 2012, PPoPP '12.

[13]  Hans-Peter Kriegel,et al.  A Fast Parallel Clustering Algorithm for Large Spatial Databases , 1999, Data Mining and Knowledge Discovery.

[14]  Min Chen,et al.  Parallel DBSCAN with Priority R-tree , 2010, 2010 2nd IEEE International Conference on Information Management and Engineering.

[15]  Md. Mostofa Ali Patwary,et al.  A Scalable Parallel Union-Find Algorithm for Distributed Memory Computers , 2009, PPAM.

[16]  J. Peacock,et al.  Simulations of the formation, evolution and clustering of galaxies and quasars , 2005, Nature.

[17]  Shankar Balachandran,et al.  A New Parallel Algorithm for Minimum Spanning Tree Problem , 2009 .

[18]  Jean R. S. Blair,et al.  Experiments on Union-Find Algorithms for the Disjoint-Set Data Structure , 2010, SEA.

[19]  Robert E. Tarjan,et al.  A Class of Algorithms which Require Nonlinear Time to Maintain Disjoint Sets , 1979, J. Comput. Syst. Sci..

[20]  Monica Casale,et al.  Minimum spanning tree: ordering edges to identify clustering structure , 2004 .

[21]  Aidong Zhang,et al.  WaveCluster: a wavelet-based clustering approach for spatial data in very large databases , 2000, The VLDB Journal.

[22]  Massimo Coppola,et al.  Experiments in Parallel Clustering with DBSCAN , 2001, Euro-Par.

[23]  Matthew B. Kennel KDTREE 2: Fortran 95 and C++ software to efficiently search for near neighbors in a multi-dimensional Euclidean space , 2004 .

[24]  Philip S. Yu,et al.  Next Generation of Data Mining , 2008, Chapman and Hall / CRC Data Mining and Knowledge Discovery Series.

[25]  Hans-Peter Kriegel,et al.  Parallel Density-Based Clustering of Complex Objects , 2006, PAKDD.

[26]  Henrik Bäcklund,et al.  TNM 033 2011-1130 1 DBSCAN A Density-Based Spatial Clustering of Application with Noise , 2011 .

[27]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[28]  Md. Mostofa Ali Patwary,et al.  Multi-core Spanning Forest Algorithms using the Disjoint-set Data Structure , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[29]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery in Databases , 1996, AI Mag..

[30]  Hans-Peter Kriegel,et al.  Hierarchical density-based clustering of uncertain data , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[31]  Boleslaw K. Szymanski,et al.  Overlapping community detection in networks: The state-of-the-art and comparative study , 2011, CSUR.

[32]  J. Kruskal On the shortest spanning subtree of a graph and the traveling salesman problem , 1956 .

[33]  G. Lucia,et al.  The hierarchical formation of the brightest cluster galaxies , 2006, astro-ph/0606519.

[34]  Michael J. Fischer,et al.  An improved equivalence algorithm , 1964, CACM.

[35]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[36]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[37]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[38]  Massimo Coppola,et al.  High-performance data mining with skeleton-based structured parallel programming , 2001, Parallel Comput..

[39]  Ujjwal Maulik,et al.  Unsupervised Satellite Image Segmentation by Combining SA Based Fuzzy Clustering with Support Vector Machine , 2009, 2009 Seventh International Conference on Advances in Pattern Recognition.

[40]  L. Collins,et al.  Omega: A General Formulation of the Rand Index of Cluster Recovery Suitable for Non-disjoint Solutions. , 1988, Multivariate behavioral research.

[41]  Mihai Surdeanu,et al.  A hybrid unsupervised approach for document clustering , 2005, KDD '05.

[42]  Jing Cao,et al.  Approaches for scaling DBSCAN algorithm to large spatial databases , 2008, Journal of Computer Science and Technology.

[43]  P. Thomas,et al.  The recycling of gas and metals in galaxy formation: predictions of a dynamical feedback model , 2007, astro-ph/0701407.

[44]  Wei-keng Liao,et al.  A new scalable parallel DBSCAN algorithm using the disjoint-set data structure , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[45]  R. Prim Shortest connection networks and some generalizations , 1957 .

[46]  Derya Birant,et al.  ST-DBSCAN: An algorithm for clustering spatial-temporal data , 2007, Data Knowl. Eng..

[47]  M. Birkner,et al.  Blow-up of semilinear PDE's at the critical dimension. A probabilistic approach , 2002 .

[48]  J. Gower,et al.  Minimum Spanning Trees and Single Linkage Cluster Analysis , 1969 .

[49]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[50]  Ying Liu,et al.  Design and evaluation of a parallel HOP clustering algorithm for cosmological simulation , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[51]  Katrin Heitmann,et al.  THE STRUCTURE OF HALOS: IMPLICATIONS FOR GROUP AND CLUSTER COSMOLOGY , 2008, 0803.3624.

[52]  Cao Jing,et al.  Approaches for scaling DBSCAN algorithm to large spatial databases , 2000 .

[53]  Anne Condon,et al.  Parallel implementation of Bouvka's minimum spanning tree algorithm , 1996, Proceedings of International Conference on Parallel Processing.