An Exact Algorithm for Semi-supervised Minimum Sum-of-Squares Clustering

The minimum sum-of-squares clustering (MSSC), or k-means type clustering, is traditionally considered an unsupervised learning task. In recent years, the use of background knowledge to improve the cluster quality and promote interpretability of the clustering process has become a hot research topic at the intersection of mathematical optimization and machine learning research. The problem of taking advantage of background information in data clustering is called semisupervised or constrained clustering. In this paper, we present a new branch-and-bound algorithm for semi-supervised MSSC, where background knowledge is incorporated as pairwise must-link and cannot-link constraints. For the lower bound procedure, we solve the semidefinite programming relaxation of the MSSC discrete optimization model, and we use a cutting-plane procedure for strengthening the bound. For the upper bound, instead, by using integer programming tools, we propose an adaptation of the k-means algorithm to the constrained case. For the first time, the proposed global optimization algorithm efficiently manages to solve real-world instances up to 800 data points with different combinations of must-link and cannot-link constraints and with a generic number of features. This problem size is about four times larger than the one of the instances solved by state-of-the-art exact algorithms.

[1]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[2]  Leo Liberti,et al.  Side-constrained minimum sum-of-squares clustering: mathematical programming and random projections , 2021, Journal of Global Optimization.

[3]  Pierre Hansen,et al.  An improved column generation algorithm for minimum sum-of-squares clustering , 2009, Math. Program..

[4]  M. Brusco A Repetitive Branch-and-Bound Procedure for Minimum Within-Cluster Sums of Squares Partitioning , 2006, Psychometrika.

[5]  Yan Yang,et al.  A Modified Cop-Kmeans Algorithm Based on Sequenced Cannot-Link Set , 2011, RSKT.

[6]  Pierre Hansen,et al.  An Interior Point Algorithm for Minimum Sum-of-Squares Clustering , 1997, SIAM J. Sci. Comput..

[7]  Yoshua Bengio,et al.  Machine Learning for Combinatorial Optimization: a Methodological Tour d'Horizon , 2018, Eur. J. Oper. Res..

[8]  Thi-Bich-Hanh Dao,et al.  Repetitive Branch-and-Bound Using Constraint Programming for Constrained Minimum Sum-of-Squares Clustering , 2016, ECAI.

[9]  Thi-Bich-Hanh Dao,et al.  Constrained Minimum Sum of Squares Clustering by Constraint Programming , 2015, CP.

[10]  Yong Cheng,et al.  A Semi-supervised Clustering Algorithm Based on Must-Link Set , 2008, ADMA.

[11]  Ke Zhou,et al.  Semi-supervised clustering with deep metric learning and graph embedding , 2019, World Wide Web.

[12]  Christian Jansson,et al.  Rigorous Error Bounds for the Optimal Value in Semidefinite Programming , 2007, SIAM J. Numer. Anal..

[13]  Ian Davidson,et al.  A Framework for Deep Constrained Clustering - Algorithms and Advances , 2019, ECML/PKDD.

[14]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[15]  Kim-Chuan Toh,et al.  SDPNAL$$+$$+: a majorized semismooth Newton-CG augmented Lagrangian method for semidefinite programming with nonnegative constraints , 2014, Math. Program. Comput..

[16]  C. Eckart,et al.  The approximation of one matrix by another of lower rank , 1936 .

[17]  Peter Gritzmann,et al.  Geometric Clustering for the Consolidation of Farmland and Woodland , 2014 .

[18]  S. S. Ravi,et al.  Clustering with Constraints: Feasibility Issues and the k-Means Algorithm , 2005, SDM.

[19]  Hanif D. Sherali,et al.  A Global Optimization RLT-based Approach for Solving the Hard Clustering Problem , 2005, J. Glob. Optim..

[20]  Keinosuke Fukunaga,et al.  A Branch and Bound Clustering Algorithm , 1975, IEEE Transactions on Computers.

[21]  B. Jaumard,et al.  Cluster Analysis and Mathematical Programming , 2003 .

[22]  Peter Gritzmann,et al.  Constrained clustering via diagrams: A unified theory and its application to electoral district design , 2017, Eur. J. Oper. Res..

[23]  Tianrui Li,et al.  AN IMPROVED COP-KMEANS ALGORITHM FOR SOLVING CONSTRAINT VIOLATION , 2010 .

[24]  Ian Davidson,et al.  Constrained Clustering via Post-processing , 2020, DS.

[25]  Xiaohua Hu,et al.  Towards effective document clustering: A constrained K-means based approach , 2008, Inf. Process. Manag..

[26]  Feiping Nie,et al.  Learning a Mahalanobis distance metric for data clustering and classification , 2008, Pattern Recognit..

[27]  Raymond J. Mooney,et al.  Integrating constraints and metric learning in semi-supervised clustering , 2004, ICML.

[28]  Farid Alizadeh,et al.  Interior Point Methods in Semidefinite Programming with Applications to Combinatorial Optimization , 1995, SIAM J. Optim..

[29]  Shokri Z. Selim,et al.  K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  S. S. Ravi,et al.  Intractability and clustering with constraints , 2007, ICML '07.

[31]  G. Diehr Evaluation of a Branch and Bound Algorithm for Clustering , 1985 .

[32]  Arindam Banerjee,et al.  Active Semi-Supervision for Pairwise Constrained Clustering , 2004, SDM.

[33]  Pedro Larrañaga,et al.  An empirical comparison of four initialization methods for the K-Means algorithm , 1999, Pattern Recognit. Lett..

[34]  P. Hansen,et al.  A branch-and-cut SDP-based algorithm for minimum sum-of-squares clustering , 2008 .

[35]  James Bailey,et al.  Lagrangian Constrained Clustering , 2016, SDM.

[36]  Tias Guns,et al.  Constrained Clustering Using Column Generation , 2014, CPAIOR.

[37]  Derya Dinler,et al.  A Survey of Constrained Clustering , 2016 .

[38]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[39]  Pierre Hansen,et al.  NP-hardness of Euclidean sum-of-squares clustering , 2008, Machine Learning.

[40]  Kim-Chuan Toh,et al.  SDPNAL+: A Matlab software for semidefinite programming with bound constraints (version 1.0) , 2017, Optim. Methods Softw..

[41]  Jiming Peng,et al.  A Cutting Algorithm for the Minimum Sum-of-Squared Error Clustering , 2005, SDM.

[42]  Ian Davidson,et al.  Constrained Clustering: Advances in Algorithms, Theory, and Applications , 2008 .

[43]  Patricio A. Vela,et al.  A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithm , 2012, Expert Syst. Appl..

[44]  Germain Forestier,et al.  Constrained Clustering: Current and New Trends , 2020, A Guided Tour of Artificial Intelligence Research.

[45]  Thi-Bich-Hanh Dao,et al.  Constrained clustering by constraint programming , 2017, Artif. Intell..

[46]  Ruggero G. Pensa,et al.  Constrained Co-clustering of Gene Expression Data , 2008, SDM.

[47]  Thi-Bich-Hanh Dao,et al.  A Declarative Framework for Constrained Clustering , 2013, ECML/PKDD.

[48]  Veronica Piccialli,et al.  SOS-SDP: An Exact Solver for Minimum Sum-of-Squares Clustering , 2021, INFORMS J. Comput..

[49]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[50]  Tom M. Mitchell,et al.  Text clustering with extended user feedback , 2006, SIGIR.

[51]  Ian Davidson,et al.  Reveling in Constraints , 2009, ACM Queue.

[52]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[53]  Jiming Peng,et al.  Advanced Optimization Laboratory Title : Approximating K-means-type clustering via semidefinite programming , 2005 .

[54]  Ioannis A. Maraziotis,et al.  A semi-supervised fuzzy clustering algorithm applied to gene expression data , 2012, Pattern Recognit..

[55]  Kim-Chuan Toh,et al.  A Convergent 3-Block SemiProximal Alternating Direction Method of Multipliers for Conic Programming with 4-Type Constraints , 2014, SIAM J. Optim..

[56]  Yu Xia,et al.  A global optimization method for semi-supervised clustering , 2009, Data Mining and Knowledge Discovery.