BoostCluster: boosting clustering by pairwise constraints

Data clustering is an important task in many disciplines. A large number of studies have attempted to improve clustering by using the side information that is often encoded as pairwise constraints. However, these studies focus on designing special clustering algorithms that can effectively exploit the pairwise constraints. We present a boosting framework for data clustering,termed as BoostCluster, that is able to iteratively improve the accuracy of any given clustering algorithm by exploiting the pairwise constraints. The key challenge in designing a boosting framework for data clustering is how to influence an arbitrary clustering algorithm with the side information since clustering algorithms by definition are unsupervised. The proposed framework addresses this problem by dynamically generating new data representations at each iteration that are, on the one hand, adapted to the clustering results at previous iterations by the given algorithm, and on the other hand consistent with the given side information. Our empirical study shows that the proposed boosting framework is effective in improving the performance of a number of popular clustering algorithms (K-means, partitional SingleLink, spectral clustering), and its performance is comparable to the state-of-the-art algorithms for data clustering with side information.

[1]  Dan Klein,et al.  Spectral Learning , 2003, IJCAI.

[2]  Arindam Banerjee,et al.  Active Semi-Supervision for Pairwise Constrained Clustering , 2004, SDM.

[3]  Ron Bekkerman,et al.  Semi-supervised Clustering using Combinatorial MRFs , 2006 .

[4]  D. Weinshall,et al.  Computing Gaussian Mixture Models with EM using Side-Information , 2003 .

[5]  Thorsten Joachims,et al.  Learning a Distance Metric from Relative Comparisons , 2003, NIPS.

[6]  Inderjit S. Dhillon,et al.  Semi-supervised graph clustering: a kernel approach , 2005, ICML '05.

[7]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[8]  Geoffrey E. Hinton,et al.  Neighbourhood Components Analysis , 2004, NIPS.

[10]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[11]  Daphna Weinshall,et al.  Learning a kernel function for classification with small training samples , 2006, ICML.

[12]  Christopher J. Merz,et al.  UCI Repository of Machine Learning Databases , 1996 .

[13]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[14]  Raymond J. Mooney,et al.  Semi-supervised clustering: probabilistic models, algorithms and experiments , 2005 .

[15]  Joachim M. Buhmann,et al.  Learning with constrained and unlabelled data , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[16]  Zhengdong Lu,et al.  Semi-supervised Learning with Penalized Probabilistic Clustering , 2004, NIPS.

[17]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[18]  Yi Liu,et al.  An Efficient Algorithm for Local Distance Metric Learning , 2006, AAAI.

[19]  D. J. Newman,et al.  UCI Repository of Machine Learning Database , 1998 .

[20]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[21]  Anil K. Jain,et al.  Model-based Clustering With Probabilistic Constraints , 2005, SDM.

[22]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[23]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[24]  Zhihua Zhang,et al.  Parametric Distance Metric Learning with Label Information , 2003, IJCAI.

[25]  Dan Klein,et al.  From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering , 2002, ICML.

[26]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[27]  S. S. Ravi,et al.  Agglomerative Hierarchical Clustering with Constraints: Theoretical and Empirical Results , 2005, PKDD.

[28]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[29]  Wei Liu,et al.  Learning Distance Metrics with Contextual Constraints for Image Retrieval , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[30]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[31]  Ivor W. Tsang,et al.  Learning with Idealized Kernels , 2003, ICML.

[32]  Dimitrios Gunopulos,et al.  A framework for semi-supervised learning based on subjective and objective clustering criteria , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[33]  Tomer Hertz,et al.  Boosting margin based distance functions for clustering , 2004, ICML.

[34]  Ayhan Demiriz,et al.  Constrained K-Means Clustering , 2000 .