A three-way cluster ensemble approach for large-scale data

Abstract Cluster ensemble has emerged as a powerful technique for combining multiple clustering results. To address the problem of clustering on large-scale data, this paper presents an efficient three-way cluster ensemble approach based on Spark, which has the ability to deal with both hard clustering and soft clustering. First, this paper proposes the framework of three-way cluster ensemble based on Spark inspired by the theory of three-way decisions, and develops a distributed three-way k-means clustering algorithm. Then, we introduce the concept of cluster unit, which reflects the minimal granularity distribution structure agreed by all the ensemble members. We also introduce quantitative measures for calculating the relationships between units and between clusters. Finally, we propose a consensus clustering algorithm based on cluster units, and we devise various three-way decision strategies to assign small cluster units and no-unit objects. The experimental results using 19 real-world data sets validate the effectiveness of the proposed approach from different indices such as ARI, ACC, NMI and F1-Measure. The experimental results show that the proposed approach can effectively deal with large-scale data, and the proposed consensus clustering algorithm has a lower time cost and does not sacrifice the clustering quality.

[1]  Hong Yu,et al.  A Framework of Three-Way Cluster Analysis , 2017, IJCRS.

[2]  Hao Wang,et al.  Parallel Semi-Supervised Multi-Ant Colonies Clustering Ensemble Based on MapReduce Methodology , 2018, IEEE Transactions on Cloud Computing.

[3]  Yiyu Yao,et al.  Cost-sensitive three-way email spam filtering , 2013, Journal of Intelligent Information Systems.

[4]  Carla E. Brodley,et al.  Solving cluster ensemble problems by bipartite graph partitioning , 2004, ICML.

[5]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[6]  Davide Ciucci,et al.  Orthopartitions and soft clustering: Soft mutual information measures for clustering validation , 2019, Knowl. Based Syst..

[7]  Mustapha Lebbah,et al.  SOM Clustering Using Spark-MapReduce , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[8]  Yiyu Yao,et al.  Decision-theoretic three-way approximations of fuzzy sets , 2014, Inf. Sci..

[9]  Weiwei Lin,et al.  An Ensemble Random Forest Algorithm for Insurance Big Data Analysis , 2017, IEEE Access.

[10]  Yiyu Yao,et al.  CE3: A three-way clustering method based on mathematical morphology , 2018, Knowl. Based Syst..

[11]  Guoyin Wang,et al.  A tree-based incremental overlapping clustering method using the three-way decision theory , 2016, Knowl. Based Syst..

[12]  Won-Ki Jeong,et al.  Vispark: GPU-accelerated distributed visual computing using spark , 2015, 2015 IEEE 5th Symposium on Large Data Analysis and Visualization (LDAV).

[13]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[14]  Yiyu Yao,et al.  Three-Way Decision: An Interpretation of Rules in Rough Set Theory , 2009, RSKT.

[15]  Yiyu Yao,et al.  Detecting and refining overlapping regions in complex networks with three-way decisions , 2016, Inf. Sci..

[16]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[17]  Sandro Vega-Pons,et al.  A Survey of Clustering Ensemble Algorithms , 2011, Int. J. Pattern Recognit. Artif. Intell..

[18]  Nouman Azam,et al.  Web-Based Medical Decision Support Systems for Three-Way Medical Decision Making With Game-Theoretic Rough Sets , 2015, IEEE Transactions on Fuzzy Systems.

[19]  Gerlof Bouma,et al.  Normalized (pointwise) mutual information in collocation extraction , 2009 .

[20]  Carlotta Domeniconi,et al.  Weighted-object ensemble clustering: methods and analysis , 2016, Knowledge and Information Systems.

[21]  Wei Lu,et al.  Clustering Large Scale Data Set Based on Distributed Local Affinity Propagation on Spark , 2016 .

[22]  L. Hubert,et al.  Comparing partitions , 1985 .

[23]  P. Lingras,et al.  Interval clustering using fuzzy and rough set theory , 2004, IEEE Annual Meeting of the Fuzzy Information, 2004. Processing NAFIPS '04..

[24]  Rohan Arora,et al.  Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means , 2015 .

[25]  Nouman Azam,et al.  A three-way clustering approach for handling missing data using GTRS , 2018, Int. J. Approx. Reason..

[26]  Yingda Lv,et al.  A novel automatic fuzzy clustering algorithm based on soft partition and membership information , 2017, Neurocomputing.

[27]  Carlotta Domeniconi,et al.  Weighted-Object Ensemble Clustering , 2013, 2013 IEEE 13th International Conference on Data Mining.

[28]  Min Chen,et al.  Interval set clustering , 2011, Expert Syst. Appl..

[29]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[30]  Yiyu Yao,et al.  Three-Way Decisions and Cognitive Computing , 2016, Cognitive Computation.

[31]  Maozhen Li,et al.  The Parallelization of Back Propagation Neural Network in MapReduce and Spark , 2016, International Journal of Parallel Programming.

[32]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[33]  Aristides Gionis,et al.  Clustering aggregation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[34]  Eréndira Rendón,et al.  Internal versus External cluster validation indexes , 2011 .

[35]  Decui Liang,et al.  Systematic studies on three-way decisions with interval-valued decision-theoretic rough sets , 2014, Inf. Sci..

[36]  Pawan Lingras,et al.  Applying Rough Set Concepts to Clustering , 2012 .

[37]  Ralph Weischedel,et al.  PERFORMANCE MEASURES FOR INFORMATION EXTRACTION , 2007 .

[38]  Alok N. Choudhary,et al.  A Scalable Hierarchical Clustering Algorithm Using Spark , 2015, 2015 IEEE First International Conference on Big Data Computing Service and Applications.

[39]  Reynold Xin,et al.  GraphX: a resilient distributed graph system on Spark , 2013, GRADES.

[40]  Joshua Zhexue Huang,et al.  Stratified feature sampling method for ensemble clustering of high dimensional data , 2015, Pattern Recognit..