Fuzzy Consensus Clustering With Applications on Big Data

Consensus clustering aims to find a single partition of data that agrees as much as possible with existing basic partitions. Given its robustness and generalizability, consensus clustering has emerged as a promising solution to find cluster structures inside heterogeneous big data rising from various application domains. In the area of fuzzy systems, however, research along this line is still in its initial stage with some unsystematic algorithmic studies. Finding a fuzzy consensus partition from multiple fuzzy basic partitions in an efficient, flexible, and robust way is still an exciting open problem calling for further investigation. In light of this, this paper provides a systematic study of fuzzy consensus clustering (FCC) from a utility perspective. Specifically, we first define the objective function of FCC clearly using the novel fuzzified contingency matrix. We then derive a family of FCC Utility functions termed as FCCU that can transform FCC to a weighted piecewise fuzzy $c$ -means clustering (piFCM) problem. This helps us to establish an algorithmic framework for FCC with flexible choice of utility functions, and speeds FCC significantly with a FCM-like iterative process of piFCM. To meet the big data challenge, we further parallelize FCC on the Spark platform with both vertical and horizontal segmentation schemes. Extensive experiments on various real-world datasets demonstrate the excellent performance of FCC, even with a majority of poor basic partitions. In particular, our method exhibits interesting potential for big data clustering in two real-life applications concerned with online event detection and overlapping community detection, respectively.

[1]  Rajesh N. Davé,et al.  Robust fuzzy clustering of relational data , 2002, IEEE Trans. Fuzzy Syst..

[2]  Joydeep Ghosh,et al.  CONSENSUS-BASED ENSEMBLES OF SOFT CLUSTERINGS , 2008, MLMTA.

[3]  Kurt Hornik,et al.  A Combination Scheme for Fuzzy Clustering , 2002, AFSS.

[4]  Hui Xiong,et al.  Adapting the right measures for K-means clustering , 2009, KDD.

[5]  James M. Keller,et al.  Relational Generalizations of Cluster Validity Indices , 2010, IEEE Transactions on Fuzzy Systems.

[6]  Sang-Ho Lee,et al.  Heterogeneous Clustering Ensemble Method for Combining Different Cluster Results , 2006, BioDM.

[7]  Hui Xiong,et al.  A Generalization of Distance Functions for Fuzzy $c$ -Means Clustering With Centroids of Arithmetic Means , 2012, IEEE Transactions on Fuzzy Systems.

[8]  J. Rice Mathematical Statistics and Data Analysis , 1988 .

[9]  E. M. L. Beale,et al.  Nonlinear Programming: A Unified Approach. , 1970 .

[10]  Joan Claudi Socoró,et al.  Positional and confidence voting-based consensus functions for fuzzy cluster ensembles , 2012, Fuzzy Sets Syst..

[11]  Anil K. Jain,et al.  A Mixture Model for Clustering Ensembles , 2004, SDM.

[12]  Boris G. Mirkin,et al.  Reinterpreting the Category Utility Function , 2001, Machine Learning.

[13]  Anil K. Jain,et al.  Combining multiple weak clusterings , 2003, Third IEEE International Conference on Data Mining.

[14]  Hui-lan Luo,et al.  Combining Multiple Clusterings using Information Theory based Genetic Algorithm , 2006, 2006 International Conference on Computational Intelligence and Security.

[15]  Swagatam Das,et al.  Axiomatic generalization of the membership degree weighting function for fuzzy C means clustering: Theoretical development and convergence analysis , 2017, Inf. Sci..

[16]  James C. Bezdek,et al.  Nerf c-means: Non-Euclidean relational fuzzy clustering , 1994, Pattern Recognit..

[17]  Ana L. N. Fred,et al.  Combining multiple clusterings using evidence accumulation , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[19]  Ming Shao,et al.  Infinite Ensemble for Image Clustering , 2016, KDD.

[20]  Junjie Wu,et al.  DIAS: A Disassemble-Assemble Framework for Highly Sparse Text Clustering , 2015, SDM.

[21]  Chris H. Q. Ding,et al.  Solving Consensus and Semi-supervised Clustering Problems Using Nonnegative Matrix Factorization , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[22]  Hui Xiong,et al.  SAIL: Summation-bAsed Incremental Learning for Information-Theoretic Text Clustering , 2013, IEEE Transactions on Cybernetics.

[23]  Zhiwu Lu,et al.  From comparing clusterings to combining clusterings , 2008, AAAI 2008.

[24]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[25]  Rajesh N. Davé,et al.  Robust clustering methods: a unified view , 1997, IEEE Trans. Fuzzy Syst..

[26]  J. Jensen Sur les fonctions convexes et les inégalités entre les valeurs moyennes , 1906 .

[27]  Marimuthu Palaniswami,et al.  Fuzzy c-Means Algorithms for Very Large Data , 2012, IEEE Transactions on Fuzzy Systems.

[28]  Abdul Suleman,et al.  Assessing a Fuzzy Extension of Rand Index and Related Measures , 2017, IEEE Transactions on Fuzzy Systems.

[29]  Yun Fu,et al.  Entropy‐based consensus clustering for patient stratification , 2017, Bioinform..

[30]  G. Karypis,et al.  Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .

[31]  Richard Weber,et al.  A methodology for dynamic data mining based on fuzzy clustering , 2005, Fuzzy Sets Syst..

[32]  Michalis Vazirgiannis,et al.  Clustering validity checking methods: part II , 2002, SGMD.

[33]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[34]  Witold Pedrycz,et al.  Collaborative clustering with the use of Fuzzy C-Means and its quantification , 2008, Fuzzy Sets Syst..

[35]  M. Degroot,et al.  Probability and Statistics , 1977 .

[36]  Sandro Vega-Pons,et al.  A Survey of Clustering Ensemble Algorithms , 2011, Int. J. Pattern Recognit. Artif. Intell..

[37]  Rich Caruana,et al.  Consensus Clusterings , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[38]  Sandro Vega-Pons,et al.  Weighted partition consensus via kernels , 2010, Pattern Recognit..

[39]  Yung-Yu Chuang,et al.  Multiple Kernel Fuzzy Clustering , 2012, IEEE Transactions on Fuzzy Systems.

[40]  L. Takac DATA ANALYSIS IN PUBLIC SOCIAL NETWORKS , 2012 .

[41]  Shitong Wang,et al.  Attribute weighted mercer kernel based fuzzy clustering algorithm for general non-spherical datasets , 2006, Soft Comput..

[42]  Lawrence O. Hall,et al.  Single Pass Fuzzy C Means , 2007, 2007 IEEE International Fuzzy Systems Conference.

[43]  Joydeep Ghosh,et al.  Cluster Ensembles A Knowledge Reuse Framework for Combining Partitionings , 2002, AAAI/IAAI.

[44]  Junjie Wu,et al.  Spectral Ensemble Clustering via Weighted K-Means: Theoretical and Practical Evidence , 2017, IEEE Transactions on Knowledge and Data Engineering.

[45]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[46]  Mohamed S. Kamel,et al.  Cumulative Voting Consensus Method for Partitions with Variable Number of Clusters , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[48]  Hui Xiong,et al.  K-Means-Based Consensus Clustering: A Unified View , 2015, IEEE Transactions on Knowledge and Data Engineering.

[49]  Joan Claudi Socoró,et al.  BordaConsensus: a new consensus function for soft cluster ensembles , 2007, SIGIR.

[50]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[51]  Thomas A. Runkler,et al.  The Generalized C Index for Internal Fuzzy Cluster Validity , 2016, IEEE Transactions on Fuzzy Systems.

[52]  Junjie Wu,et al.  Spectral Ensemble Clustering , 2015, KDD.

[53]  Inderjit S. Dhillon,et al.  Overlapping Community Detection Using Neighborhood-Inflated Seed Expansion , 2015, IEEE Transactions on Knowledge and Data Engineering.

[54]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[55]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[56]  Witold Pedrycz A dynamic data granulation through adjustable fuzzy clustering , 2008, Pattern Recognit. Lett..

[57]  Carlotta Domeniconi,et al.  Weighted cluster ensembles: Methods and analysis , 2009, TKDD.

[58]  Lawrence O. Hall,et al.  Convergence of the Single-Pass and Online Fuzzy C-Means Algorithms , 2011, IEEE Transactions on Fuzzy Systems.