MR-Mafia: Parallel Subspace Clustering Algorithm Based on MapReduce for Large Multi-dimensional Datasets

The mission of subspace clustering is to find hidden clusters exist in different subspaces within a dataset. In recent years, with the exponential growth of data size and data dimensions, traditional subspace clustering algorithms become inefficient as well as ineffective while extracting knowledge in the big data environment, resulting in an emergent need to design efficient parallel distributed subspace clustering algorithms to handle large multi-dimensional data with an acceptable computational cost. In this paper, we introduce MR-Mafia: a parallel mafia subspace clustering algorithm based on MapReduce. The algorithm takes advantage of MapReduce's data partitioning and task parallelism and achieves a good tradeoff between the cost for disk accesses and communication cost. The experimental results show near linear speedups and demonstrate the high scalability and great application prospects of the proposed algorithm.

[1]  Christos Faloutsos,et al.  Clustering very large multi-dimensional datasets with MapReduce , 2011, KDD.

[2]  S. Masih,et al.  Data Mining Techniques in Parallel and Distributed Environment – A Comprehensive Survey , 2014 .

[3]  Wei-keng Liao,et al.  Parallel Data Mining Algorithms for Association Rules and Clustering , 2007 .

[4]  Wooyoung Kim,et al.  Parallel Clustering Algorithms : Survey , 2009 .

[5]  Mohammed J. Zaki,et al.  Systems support for scalable data mining , 2000, SKDD.

[6]  Rajashree Shettar,et al.  Multidimensional Canopy Clustering on Iterative MapReduce Framework Using Elefig Tool , 2015 .

[7]  Bo Zhu,et al.  CLUS: Parallel Subspace Clustering Algorithm on Spark , 2015, ADBIS.

[8]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[9]  Alok N. Choudhary,et al.  A scalable parallel subspace clustering algorithm for massive data sets , 2000, Proceedings 2000 International Conference on Parallel Processing.

[10]  Han Xiao Towards Parallel and Distributed Computing in Large-Scale Data Mining : A Survey , 2010 .

[11]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12]  Nandita Yambem,et al.  A Survey on Data Mining Algorithms on Apache Hadoop Platform , 2014 .

[13]  Mohammed J. Zaki,et al.  A Requirements Analysis for Parallel KDD Systems , 2000, IPDPS Workshops.

[14]  Ira Assent,et al.  Evaluating Clustering in Subspace Projections of High Dimensional Data , 2009, Proc. VLDB Endow..

[15]  Stan Matwin,et al.  Meta-MapReduce for scalable data mining , 2015, Journal of Big Data.

[16]  Sujni Paul,et al.  Parallel and Distributed Data Mining , 2011 .

[17]  Hans-Peter Kriegel,et al.  Subspace clustering , 2012, WIREs Data Mining Knowl. Discov..

[18]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[19]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[20]  Amitava Datta,et al.  A novel algorithm for fast and scalable subspace clustering of high-dimensional data , 2015, Journal of Big Data.

[21]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[22]  N. S. Chandra,et al.  Clustering Algorithms For High Dimensional Data – A Survey Of Issues And Existing Approaches , 2012 .