Ensemble clustering using semidefinite programming with applications

In this paper, we study the ensemble clustering problem, where the input is in the form of multiple clustering solutions. The goal of ensemble clustering algorithms is to aggregate the solutions into one solution that maximizes the agreement in the input ensemble. We obtain several new results for this problem. Specifically, we show that the notion of agreement under such circumstances can be better captured using a 2D string encoding rather than a voting strategy, which is common among existing approaches. Our optimization proceeds by first constructing a non-linear objective function which is then transformed into a 0-1 Semidefinite program (SDP) using novel convexification techniques. This model can be subsequently relaxed to a polynomial time solvable SDP. In addition to the theoretical contributions, our experimental results on standard machine learning and synthetic datasets show that this approach leads to improvements not only in terms of the proposed agreement measure but also the existing agreement measures based on voting strategies. In addition, we identify several new application scenarios for this problem. These include combining multiple image segmentations and generating tissue maps from multiple-channel Diffusion Tensor brain images to identify the underlying structure of the brain.

[1]  William M. Wells,et al.  Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image segmentation , 2004, IEEE Transactions on Medical Imaging.

[2]  Carla E. Brodley,et al.  Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach , 2003, ICML.

[3]  David P. Williamson,et al.  Improved approximation algorithms for MAX SAT , 2000, SODA '00.

[4]  Stephen P. Boyd,et al.  Semidefinite Programming , 1996, SIAM Rev..

[5]  Andrzej Lingas,et al.  Approximation algorithms for Hamming clustering problems , 2004, J. Discrete Algorithms.

[6]  Santosh S. Vempala,et al.  The Random Projection Method , 2005, DIMACS Series in Discrete Mathematics and Theoretical Computer Science.

[7]  Y. Attikiouzel,et al.  Combining data from different algorithms to segment the skin-air interface in mammograms , 2000, Proceedings of the 22nd Annual International Conference of the IEEE Engineering in Medicine and Biology Society (Cat. No.00CH37143).

[8]  David P. Williamson,et al.  Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming , 1995, JACM.

[9]  Aristides Gionis,et al.  Clustering Aggregation , 2005, ICDE.

[10]  Venkatesan Guruswami,et al.  Correlation clustering with a fixed number of clusters , 2005, SODA '06.

[11]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[12]  Satissed Now Consider Improved Approximation Algorithms for Maximum Cut and Satissability Problems Using Semideenite Programming , 1997 .

[13]  Johan Löfberg,et al.  YALMIP : a toolbox for modeling and optimization in MATLAB , 2004 .

[14]  Jitendra Malik,et al.  Normalized Cuts and Image Segmentation , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Ludmila I. Kuncheva,et al.  Evaluation of Stability of k-Means Cluster Ensembles with Respect to Random Initialization , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Jean B. Lasserre,et al.  An Explicit Exact SDP Relaxation for Nonlinear 0-1 Programs , 2001, IPCO.

[17]  Olga Veksler,et al.  Fast Approximate Energy Minimization via Graph Cuts , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  Avrim Blum,et al.  Correlation Clustering , 2004, Machine Learning.

[19]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[20]  Anil K. Jain,et al.  A Mixture Model for Clustering Ensembles , 2004, SDM.

[21]  Kim-Chuan Toh,et al.  SDPT3 — a Matlab software package for semidefinite-quadratic-linear programming, version 3.0 , 2001 .

[22]  Sandrine Dudoit,et al.  Bagging to Improve the Accuracy of A Clustering Procedure , 2003, Bioinform..

[23]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[24]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[25]  George Bebis,et al.  Face recognition experiments with random projection , 2005, SPIE Defense + Commercial Sensing.

[26]  Johan Efberg,et al.  YALMIP : A toolbox for modeling and optimization in MATLAB , 2004 .

[27]  Venkatesan Guruswami,et al.  Clustering with qualitative information , 2005, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[28]  Torsten Rohlfing,et al.  Shape-Based Averaging for Combination of Multiple Segmentations , 2005, MICCAI.

[29]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[30]  Steven Skiena,et al.  Integrating microarray data by consensus clustering , 2003, Proceedings. 15th IEEE International Conference on Tools with Artificial Intelligence.

[31]  Torsten Rohlfing,et al.  Multi-classifier framework for atlas-based image segmentation , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[32]  Anil K. Jain,et al.  Combining multiple weak clusterings , 2003, Third IEEE International Conference on Data Mining.

[33]  Daniel P. Huttenlocher,et al.  Efficient Graph-Based Image Segmentation , 2004, International Journal of Computer Vision.

[34]  Vikas Singh,et al.  Ensemble Clustering using Semidefinite Programming , 2007, NIPS.

[35]  A. D. Gordon,et al.  An Algorithm for Euclidean Sum of Squares Classification , 1977 .

[36]  Jiming Peng,et al.  Advanced Optimization Laboratory Title : Approximating K-means-type clustering via semidefinite programming , 2005 .

[37]  Vytautas Perlibakas,et al.  Distance measures for PCA-based face recognition , 2004, Pattern Recognit. Lett..

[38]  Noga Alon,et al.  Linear equations, arithmetic progressions and hypergraph property testing , 2005, SODA '05.

[39]  H. Wolkowicz,et al.  A STRENGTHENED SDP RELAXATION via a SECOND LIFTING for the MAX-CUT PROBLEM , 1999 .

[40]  Jos F. Sturm,et al.  A Matlab toolbox for optimization over symmetric cones , 1999 .

[41]  Ana L. N. Fred,et al.  Combining multiple clusterings using evidence accumulation , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Lei Guo,et al.  Brain tissue segmentation based on DTI data , 2007, NeuroImage.

[43]  Torsten Rohlfing,et al.  Multi-classifier framework for atlas-based image segmentation , 2005, Pattern Recognit. Lett..

[44]  Chengjun Liu,et al.  Independent component analysis of Gabor features for face recognition , 2003, IEEE Trans. Neural Networks.

[45]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[46]  Carla E. Brodley,et al.  Solving cluster ensemble problems by bipartite graph partitioning , 2004, ICML.

[47]  Dale Schuurmans,et al.  Maximum Margin Clustering , 2004, NIPS.