GPUMAFIA: Efficient Subspace Clustering with MAFIA on GPUs

Clustering, i.e., the identification of regions of similar objects in a multi-dimensional data set, is a standard method of data analytics with a large variety of applications. For high-dimensional data, subspace clustering can be used to find clusters among a certain subset of data point dimensions and alleviate the curse of dimensionality. In this paper we focus on the MAFIA subspace clustering algorithm and on using GPUs to accelerate the algorithm. We first present a number of algorithmic changes and estimate their effect on computational complexity of the algorithm. These changes improve the computational complexity of the algorithm and accelerate the sequential version by 1---2 orders of magnitude on practical datasets while providing exactly the same output. We then present the GPU version of the algorithm, which for typical datasets provides a further 1---2 orders of magnitude speedup over a single CPU core or about an order of magnitude over a typical multi-core CPU. We believe that our faster implementation widens the applicability of MAFIA and subspace clustering.

[1]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[2]  Philip S. Yu,et al.  A fast algorithm for subspace clustering by pattern similarity , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[3]  Andreas Kolb,et al.  GPU-Based Multilevel Clustering , 2011, IEEE Transactions on Visualization and Computer Graphics.

[4]  Jinyan Li,et al.  Efficient mining of distance‐based subspace clusters , 2009, Stat. Anal. Data Min..

[5]  Donald C. Wunsch,et al.  A GPU based Parallel Hierarchical Fuzzy ART clustering , 2011, The 2011 International Joint Conference on Neural Networks.

[6]  Ira Assent,et al.  Evaluating Clustering in Subspace Projections of High Dimensional Data , 2009, Proc. VLDB Endow..

[7]  James M. Keller,et al.  Speedup of Fuzzy Clustering Through Stream Processing on Graphics Processing Units , 2008, IEEE Transactions on Fuzzy Systems.

[8]  Elke Achtert,et al.  Detection and Visualization of Subspace Cluster Hierarchies , 2007, DASFAA.

[9]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[10]  Jinyan Li,et al.  Efficient mining of distance-based subspace clusters , 2009 .

[11]  Alok Choudhary,et al.  Parallel Algorithms for Clustering High-Dimensional Large-Scale Datasets , 2001 .

[12]  Hans-Peter Kriegel,et al.  Density-Connected Subspace Clustering for High-Dimensional Data , 2004, SDM.

[13]  Russ B. Altman,et al.  CAMPAIGN: an open-source library of GPU-accelerated data clustering algorithms , 2011, Bioinform..

[14]  Meichun Hsu,et al.  Clustering billions of data points using GPUs , 2009, UCHPC-MAW '09.

[15]  Meng Li,et al.  Stream Operators for Querying Data Streams , 2005, WAIM.

[16]  Mukesh K. Mohania,et al.  Advances in Databases: Concepts, Systems and Applications , 2007 .

[17]  Anthony K. H. Tung,et al.  Scalable Clustering Using Graphics Processors , 2006, WAIM.

[18]  Man Lung Yiu,et al.  Group-by skyline query processing in relational engines , 2009, CIKM.

[19]  Christian Böhm,et al.  Density-based clustering using graphics processors , 2009, CIKM.

[20]  Huan Liu,et al.  Evaluating Subspace Clustering Algorithms , 2004 .

[21]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[22]  Jinyan Li,et al.  Distance Based Subspace Clustering with Flexible Dimension Partitioning , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[23]  R. Bellman Dynamic programming. , 1957, Science.

[24]  He Li,et al.  K-Means on Commodity GPUs with CUDA , 2009, 2009 WRI World Congress on Computer Science and Information Engineering.