Toward Multidiversified Ensemble Clustering of High-Dimensional Data: From Subspaces to Metrics and Beyond

The rapid emergence of high-dimensional data in various areas has brought new challenges to current ensemble clustering research. To deal with the curse of dimensionality, recently considerable efforts in ensemble clustering have been made by means of different subspace-based techniques. However, besides the emphasis on subspaces, rather limited attention has been paid to the potential diversity in similarity/dissimilarity metrics. It remains a surprisingly open problem in ensemble clustering how to create and aggregate a large population of diversified metrics, and furthermore, how to jointly investigate the multilevel diversity in the large populations of metrics, subspaces, and clusters in a unified framework. To tackle this problem, this article proposes a novel multidiversified ensemble clustering approach. In particular, we create a large number of diversified metrics by randomizing a scaled exponential similarity kernel, which are then coupled with random subspaces to form a large set of metric-subspace pairs. Based on the similarity matrices derived from these metric-subspace pairs, an ensemble of diversified base clusterings can be thereby constructed. Furthermore, an entropy-based criterion is utilized to explore the cluster wise diversity in ensembles, based on which three specific ensemble clustering algorithms are presented by incorporating three types of consensus functions. Extensive experiments are conducted on 30 high-dimensional datasets, including 18 cancer gene expression datasets and 12 image/speech datasets, which demonstrate the superiority of our algorithms over the state of the art. The source code is available at https://github.com/huangdonghere/MDEC.

[1]  Chang-Dong Wang,et al.  Ensembling over-segmentations: From weak evidence to strong segmentation , 2016, Neurocomputing.

[2]  Anil K. Jain,et al.  Clustering ensembles: models of consensus and weak partitions , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Jingsheng Lei,et al.  A clustering ensemble: Two-level-refined co-association matrix with path-based transformation , 2015, Pattern Recognit..

[4]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[5]  Tossapon Boongoen,et al.  A Link-Based Approach to the Cluster Ensemble Problem , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Hui Xiong,et al.  A Theoretic Framework of K-Means-Based Consensus Clustering , 2013, IJCAI.

[7]  Beilun Wang,et al.  Kernelized Information-Theoretic Metric Learning for Cancer Diagnosis Using High-Dimensional Molecular Profiling Data , 2016, ACM Trans. Knowl. Discov. Data.

[8]  Jane You,et al.  Adaptive Ensembling of Semi-Supervised Clustering Solutions , 2017, IEEE Transactions on Knowledge and Data Engineering.

[9]  Yun Fu,et al.  Entropy‐based consensus clustering for patient stratification , 2017, Bioinform..

[10]  Pong C. Yuen,et al.  Semi-supervised metric learning via topology preserving multiple semi-supervised assumptions , 2013, Pattern Recognit..

[11]  Brendan J. Frey,et al.  Factor graphs and the sum-product algorithm , 2001, IEEE Trans. Inf. Theory.

[12]  Yike Guo,et al.  An Ensemble Clusterer of Multiple Fuzzy $k$ -Means Clusterings to Recognize Arbitrarily Shaped Clusters , 2018, IEEE Transactions on Fuzzy Systems.

[13]  Hareton K. N. Leung,et al.  Incremental Semi-Supervised Clustering Ensemble for High Dimensional Data Clustering , 2016, IEEE Transactions on Knowledge and Data Engineering.

[14]  D. B. Graham,et al.  Characterising Virtual Eigensignatures for General Purpose Face Recognition , 1998 .

[15]  Hui Xiong,et al.  K-Means-Based Consensus Clustering: A Unified View , 2015, IEEE Transactions on Knowledge and Data Engineering.

[16]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[17]  Carla E. Brodley,et al.  Solving cluster ensemble problems by bipartite graph partitioning , 2004, ICML.

[18]  Jane You,et al.  Semi-Supervised Ensemble Clustering Based on Selected Constraint Projection , 2018, IEEE Transactions on Knowledge and Data Engineering.

[19]  Shih-Fu Chang,et al.  Segmentation using superpixels: A bipartite graph partitioning approach , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Chris H. Q. Ding,et al.  Weighted Consensus Clustering , 2008, SDM.

[21]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[22]  Chang-Dong Wang,et al.  Ultra-Scalable Spectral Clustering and Ensemble Clustering , 2019, IEEE Transactions on Knowledge and Data Engineering.

[23]  Chang-Dong Wang,et al.  Robust Ensemble Clustering Using Probability Trajectories , 2016, IEEE Transactions on Knowledge and Data Engineering.

[24]  Xi Wang,et al.  Clustering aggregation by probability accumulation , 2009, Pattern Recognit..

[25]  Zhiwen Yu,et al.  Graph-based consensus clustering for class discovery from gene expression data , 2007, Bioinform..

[26]  QiYanjun,et al.  Kernelized Information-Theoretic Metric Learning for Cancer Diagnosis Using High-Dimensional Molecular Profiling Data , 2016 .

[27]  Longin Jan Latecki,et al.  Clustering Aggregation as Maximum-Weight Independent Set , 2012, NIPS.

[28]  Klaus Mueller,et al.  A Structure-Based Distance Metric for High-Dimensional Space Exploration with Multidimensional Scaling , 2014, IEEE Trans. Vis. Comput. Graph..

[29]  Yun Fu,et al.  Marginalized Multiview Ensemble Clustering , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[30]  Carla E. Brodley,et al.  Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach , 2003, ICML.

[31]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[32]  Joshua Zhexue Huang,et al.  Stratified feature sampling method for ensemble clustering of high dimensional data , 2015, Pattern Recognit..

[33]  Ming-Syan Chen,et al.  On the Design and Applicability of Distance Functions in High-Dimensional Data Space , 2009, IEEE Trans. Knowl. Data Eng..

[34]  Yike Guo,et al.  An Information-Theoretical Framework for Cluster Ensemble , 2019, IEEE Transactions on Knowledge and Data Engineering.

[35]  Pong C. Yuen,et al.  Semi-supervised Region Metric Learning for Person Re-identification , 2018, International Journal of Computer Vision.

[36]  Jinfeng Yi,et al.  Robust Ensemble Clustering by Matrix Completion , 2012, 2012 IEEE 12th International Conference on Data Mining.

[37]  Daoqiang Zhang,et al.  WoCE: A framework for Clustering Ensemble by Exploiting the Wisdom of Crowds Theory , 2016, IEEE Transactions on Cybernetics.

[38]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[39]  Tsaipei Wang,et al.  CA-Tree: A Hierarchical Structure for Efficient and Scalable Coassociation-Based Cluster Ensembles , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[40]  Zhuowen Tu,et al.  Similarity network fusion for aggregating data types on a genomic scale , 2014, Nature Methods.

[41]  Jun Wang,et al.  Co-Clustering Ensembles Based on Multiple Relevance Measures , 2021, IEEE Transactions on Knowledge and Data Engineering.

[42]  Rong Wang,et al.  Submanifold-Preserving Discriminant Analysis With an Auto-Optimized Graph , 2020, IEEE Transactions on Cybernetics.

[43]  Ming Gu,et al.  Fast Low-rank Metric Learning for Large-scale and High-dimensional Data , 2019, NeurIPS.

[44]  Zhiwen Yu,et al.  Adaptive Noise Immune Cluster Ensemble Using Affinity Propagation , 2015, IEEE Transactions on Knowledge and Data Engineering.

[45]  Yong Dou,et al.  Absent Multiple Kernel Learning Algorithms , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Sameer A. Nene,et al.  Columbia Object Image Library (COIL100) , 1996 .

[47]  Chang-Dong Wang,et al.  Ensemble clustering using factor graph , 2016, Pattern Recognit..

[48]  Tao Li,et al.  A Framework for Hierarchical Ensemble Clustering , 2014, TKDD.

[49]  Unsupervised Feature Learning Architecture with Multi-clustering Integration RBM , 2018, IEEE Transactions on Knowledge and Data Engineering.

[50]  Junjie Wu,et al.  Spectral Ensemble Clustering via Weighted K-Means: Theoretical and Practical Evidence , 2017, IEEE Transactions on Knowledge and Data Engineering.

[51]  Frank Plastria,et al.  On the point for which the sum of the distances to n given points is minimum , 2009, Ann. Oper. Res..

[52]  Alexander Schliep,et al.  Clustering cancer gene expression data: a comparative study , 2008, BMC Bioinformatics.

[53]  Chang-Dong Wang,et al.  Enhanced Ensemble Clustering via Fast Propagation of Cluster-Wise Similarities , 2018, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[54]  Ana L. N. Fred,et al.  Combining multiple clusterings using evidence accumulation , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55]  Kaile Su,et al.  Restart and Random Walk in Local Search for Maximum Vertex Weight Cliques with Evaluations in Clustering Aggregation , 2017, IJCAI.

[56]  Chang-Dong Wang,et al.  Combining multiple clusterings via crowd agreement estimation and multi-granularity link analysis , 2014, Neurocomputing.

[57]  Chang-Dong Wang,et al.  Locally Weighted Ensemble Clustering , 2016, IEEE Transactions on Cybernetics.

[58]  Zhihui Li,et al.  Spectral Clustering of Customer Transaction Data With a Two-Level Subspace Weighting Method , 2019, IEEE Transactions on Cybernetics.

[59]  Ming Shao,et al.  Infinite ensemble clustering , 2017, Data Mining and Knowledge Discovery.

[60]  Xiaoyi Jiang,et al.  Ensemble clustering by means of clustering embedding in vector spaces , 2014, Pattern Recognit..

[61]  Jun Zhang,et al.  Transfer Clustering Ensemble Selection , 2020, IEEE Transactions on Cybernetics.

[62]  Yuhua Qian,et al.  Clustering ensemble based on sample's stability , 2019, Artif. Intell..

[63]  Andrew Zisserman,et al.  A Visual Vocabulary for Flower Classification , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).