Coupled Clustering Ensemble by Exploring Data Interdependence

Clustering ensembles combine multiple partitions of data into a single clustering solution. It is an effective technique for improving the quality of clustering results. Current clustering ensemble algorithms are usually built on the pairwise agreements between clusterings that focus on the similarity via consensus functions, between data objects that induce similarity measures from partitions and re-cluster objects, and between clusters that collapse groups of clusters into meta-clusters. In most of those models, there is a strong assumption on IIDness (i.e., independent and identical distribution), which states that base clusterings perform independently of one another and all objects are also independent. In the real world, however, objects are generally likely related to each other through features that are either explicit or even implicit. There is also latent but definite relationship among intermediate base clusterings because they are derived from the same set of data. All these demand a further investigation of clustering ensembles that explores the interdependence characteristics of data. To solve this problem, a new coupled clustering ensemble (CCE) framework that works on the interdependence nature of objects and intermediate base clusterings is proposed in this article. The main idea is to model the coupling relationship between objects by aggregating the similarity of base clusterings, and the interactive relationship among objects by addressing their neighborhood domains. Once these interdependence relationships are discovered, they will act as critical supplements to clustering ensembles. We verified our proposed framework by using three types of consensus function: clustering-based, object-based, and cluster-based. Substantial experiments on multiple synthetic and real-life benchmark datasets indicate that CCE can effectively capture the implicit interdependence relationships among base clusterings and among objects with higher clustering accuracy, stability, and robustness compared to 14 state-of-the-art techniques, supported by statistical analysis. In addition, we show that the final clustering quality is dependent on the data characteristics (e.g., quality and consistency) of base clusterings in terms of sensitivity analysis. Finally, the applications in document clustering, as well as on the datasets with much larger size and dimensionality, further demonstrate the effectiveness, efficiency, and scalability of our proposed models.

[1]  Jianhong Wu,et al.  Data clustering - theory, algorithms, and applications , 2007 .

[2]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[3]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[4]  Carla E. Brodley,et al.  Solving cluster ensemble problems by bipartite graph partitioning , 2004, ICML.

[5]  Ioannis T. Christou,et al.  Coordination of Cluster Ensembles via Exact Methods , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Bertil Schmidt,et al.  Scalable Clustering by Iterative Partitioning and Point Attractor Representation , 2016, ACM Trans. Knowl. Discov. Data.

[7]  Marimuthu Palaniswami,et al.  Fuzzy c-Means Algorithms for Very Large Data , 2012, IEEE Transactions on Fuzzy Systems.

[8]  Hui Xiong,et al.  Understanding and Enhancement of Internal Clustering Validation Measures , 2013, IEEE Transactions on Cybernetics.

[9]  Raymond K. Wong,et al.  Coupled Interdependent Attribute Analysis on Mixed Data , 2015, AAAI.

[10]  Guojun Gan,et al.  Data Clustering: Theory, Algorithms, and Applications (ASA-SIAM Series on Statistics and Applied Probability) , 2007 .

[11]  Fei Zhou,et al.  Coupled Attribute Similarity Learning on Categorical Data , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[12]  Michael K. Ng,et al.  Subspace clustering using affinity propagation , 2015, Pattern Recognit..

[13]  Philip S. Yu,et al.  Coupled Behavior Analysis with Applications , 2012, IEEE Transactions on Knowledge and Data Engineering.

[14]  Tossapon Boongoen,et al.  A Link-Based Approach to the Cluster Ensemble Problem , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Anil K. Jain,et al.  Clustering ensembles: models of consensus and weak partitions , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Nan Liu,et al.  Knowledge Acquisition and Representation Using Fuzzy Evidential Reasoning and Dynamic Adaptive Fuzzy Petri Nets , 2013, IEEE Transactions on Cybernetics.

[17]  Renée J. Miller,et al.  LIMBO: Scalable Clustering of Categorical Data , 2004, EDBT.

[18]  Aristides Gionis,et al.  Clustering aggregation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[19]  David D. Lewis,et al.  Reuters-21578 Text Categorization Test Collection, Distribution 1.0 , 1997 .

[20]  Ludmila I. Kuncheva,et al.  Evaluation of Stability of k-Means Cluster Ensembles with Respect to Random Initialization , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[22]  Sandro Vega-Pons,et al.  A Survey of Clustering Ensemble Algorithms , 2011, Int. J. Pattern Recognit. Artif. Intell..

[23]  Carlotta Domeniconi,et al.  Weighted cluster ensembles: Methods and analysis , 2009, TKDD.

[24]  Rich Caruana,et al.  Consensus Clusterings , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[25]  Longbing Cao,et al.  Coupled clustering ensemble: Incorporating coupling relationships both between base clusterings and objects , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[26]  D. Wolpert,et al.  No Free Lunch Theorems for Search , 1995 .

[27]  Vipin Kumar,et al.  Similarity Measures for Categorical Data: A Comparative Evaluation , 2008, SDM.

[28]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[29]  Tao Li,et al.  On combining multiple clusterings: an overview and a new perspective , 2010, Applied Intelligence.

[30]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[31]  Victor Maojo,et al.  A context vector model for information retrieval , 2002, J. Assoc. Inf. Sci. Technol..

[32]  Nicolás García-Pedrajas,et al.  Democratic instance selection: A linear complexity instance selection algorithm based on classifier ensemble concepts , 2010, Artif. Intell..

[33]  Kathryn B. Laskey,et al.  Nonparametric Bayesian Clustering Ensembles , 2010, ECML/PKDD.

[34]  Hui Xiong,et al.  A Theoretic Framework of K-Means-Based Consensus Clustering , 2013, IJCAI.

[35]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[36]  Ludmila I. Kuncheva,et al.  Using diversity in cluster ensembles , 2004, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583).

[37]  Akiko Aizawa,et al.  An information-theoretic perspective of tf-idf measures , 2003, Inf. Process. Manag..

[38]  Yuming Zhou,et al.  A novel ensemble method for classifying imbalanced data , 2015, Pattern Recognit..

[39]  Joydeep Ghosh,et al.  APPROVED BY SUPERVISING COMMITTEE: , 2002 .

[40]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[41]  Jiawei Han,et al.  Document clustering using locality preserving indexing , 2005, IEEE Transactions on Knowledge and Data Engineering.

[42]  Steven Salzberg,et al.  A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features , 2004, Machine Learning.

[43]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[44]  Tossapon Boongoen,et al.  Improved link-based cluster ensembles , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[45]  Longbing Cao,et al.  Non-IIDness Learning in Behavioral and Social Data , 2014, Comput. J..

[46]  Hui Xiong,et al.  K-Means-Based Consensus Clustering: A Unified View , 2015, IEEE Transactions on Knowledge and Data Engineering.

[47]  Longbing Cao,et al.  Coupled nominal similarity in unsupervised learning , 2011, CIKM '11.

[48]  Xiaoyi Jiang,et al.  Ensemble clustering by means of clustering embedding in vector spaces , 2014, Pattern Recognit..

[49]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).