Hierarchical Density-Based Clustering based on GPU Accelerated Data Indexing Strategy

Due the recent increase of the volume of data that has been generated, organizing this data has become one of the biggest problems in Computer Science. Among the different strategies propose to deal efficiently and effectively for this purpose, we highlight those related to clustering, more specifically, density-based clustering strategies, which stands out for its ability to define clusters of arbitrary shape and the robustness to deal with the presence of data noise, such as DBSCAN and OPTICS. However, these algorithms are still a computational challenge since they are distance-based proposals. In this work we present a new approach to make OPTICS feasible based on data indexing strategy. Although the simplicity with which the data are indexed, using graphs, it allows explore various parallelization opportunities, which were explored using graphic processing unit (GPU). Based on this structure, the complexity of OPTICS is reduced to O(E *logV ) in the worst case, becoming itself very fast. In our evaluation we show that our proposal can be over 200x faster than its sequential version using CPU.

[1]  Vance Faber,et al.  Clustering and the continuous k-means algorithm , 1994 .

[2]  Alexander Hinneburg,et al.  DENCLUE 2.0: Fast Clustering Based on Kernel Density Estimation , 2007, IDA.

[3]  Rafael Sachetto Oliveira,et al.  G-DBSCAN: A GPU Accelerated Algorithm for Density-based Clustering , 2013, ICCS.

[4]  Yongik Yoon,et al.  Clustered Indexing Technique for Multidimensional Index Structures , 2002, DEXA.

[5]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[6]  Min Chen,et al.  Parallel DBSCAN with Priority R-tree , 2010, 2010 2nd IEEE International Conference on Information Management and Engineering.

[7]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[8]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[9]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[10]  Guy E. Blelloch,et al.  Prefix sums and their applications , 1990 .

[11]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[12]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[13]  Wei-keng Liao,et al.  Scalable parallel OPTICS data clustering using graph algorithmic techniques , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[14]  J. Akilandeswari,et al.  A SURVEY ON PARTITION CLUSTERING ALGORITHMS , 2011 .

[15]  Dariu M. Gavrila,et al.  R-Tree Index Optimization , 1994 .

[16]  Elena Deza,et al.  Encyclopedia of Distances , 2014 .

[17]  Christian Böhm,et al.  Density-based clustering using graphics processors , 2009, CIKM.

[18]  Charu C. Aggarwal,et al.  Data Clustering: Algorithms and Applications , 2014 .