A Survey of Parallel Clustering Algorithms Based on Spark

Clustering is one of the most important unsupervised machine learning tasks, which is widely used in information retrieval, social network analysis, image processing, and other fields. With the explosive growth of data, the classical clustering algorithms cannot meet the requirements of clustering for big data. Spark is one of the most popular parallel processing platforms for big data, and many researchers have proposed many parallel clustering algorithms based on Spark. In this paper, the existing parallel clustering algorithms based on Spark are classified and summarized, the parallel design framework of each kind of algorithms is discussed, and after comparing different kinds of algorithms, the direction of the future research is discussed.

[1]  Nadia Essoussi,et al.  KP-S: A Spark-Based Design of the K-Prototypes Clustering for Big Data , 2017, 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA).

[2]  Jianhui Li,et al.  PGCAS: A Parallelized Graph Clustering Algorithm Based on Spark , 2018, BigSDM.

[3]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[4]  D. P. Acharjya,et al.  Clustering Algorithm in Possibilistic Exponential Fuzzy C-Mean Segmenting Medical Images , 2017 .

[5]  Fred Glover,et al.  Tabu Search - Part II , 1989, INFORMS J. Comput..

[6]  Ahmed I. Taloba,et al.  Developing an efficient spectral clustering algorithm on large scale graphs in spark , 2017, 2017 Eighth International Conference on Intelligent Computing and Information Systems (ICICIS).

[7]  Zhexue Huang,et al.  CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES , 1997 .

[8]  Rong Zheng,et al.  RECOME: a New Density-Based Clustering Algorithm Using Relative KNN Kernel Density , 2016, Inf. Sci..

[9]  Fred W. Glover,et al.  Tabu Search - Part I , 1989, INFORMS J. Comput..

[10]  Ishwarappa,et al.  A Brief Introduction on Big Data 5Vs Characteristics and Hadoop Technology , 2015 .

[11]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[12]  Xin-She Yang,et al.  Firefly Algorithms for Multimodal Optimization , 2009, SAGA.

[13]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[14]  Farhana H. Zulkernine,et al.  Particle swarm optimization for large-scale clustering on apache spark , 2017, 2017 IEEE Symposium Series on Computational Intelligence (SSCI).

[15]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[16]  Rong Gu,et al.  Improving Execution Concurrency of Large-Scale Matrix Multiplication on Distributed Data-Parallel Platforms , 2017, IEEE Transactions on Parallel and Distributed Systems.

[17]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[18]  A. Shobanadevi,et al.  Studying the performance of clustering techniques for biomedical data using spark , 2017, 2017 International Conference on Intelligent Sustainable Systems (ICISS).

[19]  Aruna Tiwari,et al.  Handling Big Data with Fuzzy Based Classification Approach , 2013, WCSC.

[20]  Tim Kraska,et al.  MLI: An API for Distributed Machine Learning , 2013, 2013 IEEE 13th International Conference on Data Mining.

[21]  Hui Xiong,et al.  SAIL: Summation-bAsed Incremental Learning for Information-Theoretic Text Clustering , 2013, IEEE Transactions on Cybernetics.

[22]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[23]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[24]  Vasilis Vassalos,et al.  A Framework for Clustering and Classification of Big Data Using Spark , 2016, OTM Conferences.

[25]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[26]  Alok N. Choudhary,et al.  A Scalable Hierarchical Clustering Algorithm Using Spark , 2015, 2015 IEEE First International Conference on Big Data Computing Service and Applications.

[27]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[28]  G. Karypis,et al.  Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .

[29]  Xicheng Tan,et al.  Research on the Parallelization of the DBSCAN Clustering Algorithm for Spatial Data Mining Based on the Spark Platform , 2017, Remote. Sens..

[30]  Hans-Peter Kriegel,et al.  Density-Connected Subspace Clustering for High-Dimensional Data , 2004, SDM.

[31]  Peter Eades,et al.  FADE: Graph Drawing, Clustering, and Visual Abstraction , 2000, GD.

[32]  Philip S. Yu,et al.  Redefining Clustering for High-Dimensional Applications , 2002, IEEE Trans. Knowl. Data Eng..

[33]  Simon Foster,et al.  Optics , 1981, Arch. Formal Proofs.

[34]  Zhi Wei,et al.  REMOLD: An Efficient Model-Based Clustering Algorithm for Large Datasets with Spark , 2017, 2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS).

[35]  Chiranji Lal Chowdhary,et al.  An Efficient Segmentation and Classification System in Medical Images Using Intuitionist Possibilistic Fuzzy C-Mean Clustering and Fuzzy SVM Algorithm , 2020, Sensors.

[36]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[37]  Eirini Ntoutsi,et al.  Scalable Online-Offline Stream Clustering in Apache Spark , 2016, 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW).

[38]  Tunchan Cura,et al.  A particle swarm optimization approach to clustering , 2012, Expert Syst. Appl..

[39]  Jingbin Wang,et al.  SparkSCAN: A Structure Similarity Clustering Algorithm on Spark , 2015 .

[40]  Chiranji Lal Chowdhary,et al.  A Fuzzy based Data Perturbation Technique for Privacy Preserved Data Mining , 2020, 2020 International Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE).

[41]  Boris Mirkin,et al.  Clustering For Data Mining: A Data Recovery Approach (Chapman & Hall/Crc Computer Science) , 2005 .

[42]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[43]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[44]  Sean Hughes,et al.  Clustering by Fast Search and Find of Density Peaks , 2016 .

[45]  Rui Liu,et al.  Parallel Implementation of Density Peaks Clustering Algorithm Based on Spark , 2017 .

[46]  Quan Qian,et al.  A Spark-Based Artificial Bee Colony Algorithm for Large-Scale Data Clustering , 2018, 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[47]  M. Anwar Ma'sum,et al.  Design of intelligent k-means based on spark for big data clustering , 2016, 2016 International Workshop on Big Data and Information Security (IWBIS).

[48]  Daniel A. Keim,et al.  Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering , 1999, VLDB.

[49]  Bo Zhu,et al.  CLUS: Parallel Subspace Clustering Algorithm on Spark , 2015, ADBIS.

[50]  Won-Ki Jeong,et al.  GPU in-Memory Processing Using Spark for Iterative Computation , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[51]  Elena Ivannikova,et al.  Scalable implementation of dependence clustering in Apache Spark , 2017, 2017 Evolving and Adaptive Intelligent Systems (EAIS).

[52]  Richard O. Sinnott,et al.  RT-DBSCAN: Real-Time Parallel Clustering of Spatio-Temporal Data Using Spark-Streaming , 2018, ICCS.

[53]  Marcos Dias de Assunção,et al.  Apache Spark , 2019, Encyclopedia of Big Data Technologies.

[54]  Davide Anguita,et al.  Big Data Analytics in the Cloud: Spark on Hadoop vs MPI/OpenMP on Beowulf , 2015, INNS Conference on Big Data.

[55]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[56]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[57]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[58]  Hae-Sang Park,et al.  A simple and fast algorithm for K-medoids clustering , 2009, Expert Syst. Appl..

[59]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[60]  Adel M. Alimi,et al.  Survey on clustering methods: Towards fuzzy clustering for big data , 2014, 2014 6th International Conference of Soft Computing and Pattern Recognition (SoCPaR).

[61]  Marimuthu Palaniswami,et al.  Fuzzy c-Means Algorithms for Very Large Data , 2012, IEEE Transactions on Fuzzy Systems.

[62]  Xin-She Yang,et al.  Bat algorithm: literature review and applications , 2013, Int. J. Bio Inspired Comput..

[63]  Zahir Tari,et al.  A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis , 2014, IEEE Transactions on Emerging Topics in Computing.

[64]  Ryan P. Browne,et al.  Model-Based Learning Using a Mixture of Mixtures of Gaussian and Uniform Distributions , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[65]  Ira Assent,et al.  Evaluating Clustering in Subspace Projections of High Dimensional Data , 2009, Proc. VLDB Endow..

[66]  Fred W. Glover,et al.  A Tabu search based clustering algorithm and its parallel implementation on Spark , 2017, Appl. Soft Comput..

[67]  Amar Mani Aryal,et al.  SparkSNN: A density-based clustering algorithm on spark , 2018, 2018 IEEE 3rd International Conference on Big Data Analysis (ICBDA).

[68]  Aruna Tiwari,et al.  Fuzzy Based Scalable Clustering Algorithms for Handling Big Data Using Apache Spark , 2016, IEEE Transactions on Big Data.

[69]  James M. Keller,et al.  The possibilistic C-means algorithm: insights and recommendations , 1996, IEEE Trans. Fuzzy Syst..

[70]  Vipin Kumar,et al.  Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data , 2003, SDM.

[71]  V. Santhi,et al.  Performance Analysis of Parallel K-Means with Optimization Algorithms for Clustering on Spark , 2018, ICDCIT.

[72]  Aidong Zhang,et al.  WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases , 1998, VLDB.

[73]  Dantong Ouyang,et al.  An artificial bee colony approach for clustering , 2010, Expert Syst. Appl..

[74]  Wei-keng Liao,et al.  Parallel hierarchical clustering on shared memory platforms , 2012, 2012 19th International Conference on High Performance Computing.

[75]  D. P. Acharjya,et al.  Segmentation of Mammograms Using a Novel Intuitionistic Possibilistic Fuzzy C -Mean Clustering Algorithm , 2018 .

[76]  Sergio M. Savaresi,et al.  On the performance of bisecting K-means and PDDP , 2001, SDM.

[77]  Di Ma,et al.  MR-DBSCAN: An Efficient Parallel Density-Based Clustering Algorithm Using MapReduce , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[78]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.