Semi-supervised model-based clustering with positive and negative constraints

Cluster analysis is a popular technique in statistics and computer science with the objective of grouping similar observations in relatively distinct groups generally known as clusters. Semi-supervised clustering assumes that some additional information about group memberships is available. Under the most frequently considered scenario, labels are known for some portion of data and unavailable for the rest of observations. In this paper, we discuss a general type of semi-supervised clustering defined by so called positive and negative constraints. Under positive constraints, some data points are required to belong to the same cluster. On the contrary, negative constraints specify that particular points must represent different data groups. We outline a general framework for semi-supervised clustering with constraints naturally incorporating the additional information into the EM algorithm traditionally used in mixture modeling and model-based clustering. The developed methodology is illustrated on synthetic and classification datasets. A dendrochronology application is considered and thoroughly discussed.

[1]  T. Swetnam,et al.  Dendroclimatology : progress and prospects , 2011 .

[2]  Volodymyr Melnykov,et al.  On the distribution of posterior probabilities in finite mixture models with application in clustering , 2013, J. Multivar. Anal..

[3]  Thierry Denoeux,et al.  Learning from partially supervised data using mixture models and belief functions , 2009, Pattern Recognit..

[4]  Harold C. Fritts,et al.  The International Tree-Ring Data Bank: an enhanced global database serving the global scientific community , 1997 .

[5]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[6]  K. Hanecaa,et al.  Provenancing Baltic timber from art historical objects: success and limitations , 2004 .

[7]  Wei-Chen Chen,et al.  Model‐based clustering of regression time series data via APECM—an AECM algorithm sung to an even faster beat , 2011, Stat. Anal. Data Min..

[8]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[9]  Mark Hasegawa-Johnson,et al.  On Semi-Supervised Learning of Gaussian Mixture Models for Phonetic Classification , 2009, HLT-NAACL 2009.

[10]  Vassilios Digalakis,et al.  Speaker adaptation using constrained estimation of Gaussian mixtures , 1995, IEEE Trans. Speech Audio Process..

[11]  Li Wang,et al.  Hybrid huberized support vector machines for microarray classification , 2007, ICML '07.

[12]  Adrian E. Raftery,et al.  MCLUST Version 3 for R: Normal Mixture Modeling and Model-Based Clustering † , 2007 .

[13]  Charles Bouveyron,et al.  Model-based clustering of high-dimensional data: A review , 2014, Comput. Stat. Data Anal..

[14]  N. Campbell,et al.  A multivariate study of variation in two species of rock crab of the genus Leptograpsus , 1974 .

[15]  Ranjan Maitra,et al.  Simulating Data to Study Performance of Finite Mixture Modeling and Clustering Algorithms , 2010 .

[16]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[17]  Anil K. Jain,et al.  Model-based Clustering With Probabilistic Constraints , 2005, SDM.

[18]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[19]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[20]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[21]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[22]  Wei Pan,et al.  Semi-supervised spectral clustering with application to detect population stratification , 2013, Front. Genet..

[23]  M. Bridge Locating the origins of wood resources: a review of dendroprovenancing , 2012 .

[24]  Wei Pan,et al.  Semi-supervised learning via penalized mixture model with application to microarray sample classification , 2006, Bioinform..

[25]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[26]  Christian Hennig,et al.  Methods for merging Gaussian mixture components , 2010, Adv. Data Anal. Classif..

[27]  Wei-Chen Chen,et al.  MixSim: An R Package for Simulating Data to Study Performance of Clustering Algorithms , 2012 .

[28]  Tomer Hertz,et al.  Computing Gaussian Mixture Models with EM Using Equivalence Constraints , 2003, NIPS.

[29]  Arindam Banerjee,et al.  Semi-supervised Clustering by Seeding , 2002, ICML.

[30]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[31]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[32]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[33]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[34]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[35]  Zhengdong Lu,et al.  Penalized Probabilistic Clustering , 2007, Neural Computation.

[36]  Volodymyr Melnykov,et al.  Finite mixture models and model-based clustering , 2010 .

[37]  A. Raftery,et al.  Bayesian model averaging in model-based clustering and density estimation , 2015, 1506.09035.

[38]  Ian Davidson,et al.  Constrained Clustering: Advances in Algorithms, Theory, and Applications , 2008 .

[39]  Volodymyr Melnykov,et al.  Efficient estimation in model‐based clustering of Gaussian regression time series , 2012, Stat. Anal. Data Min..

[40]  Padhraic Smyth,et al.  Trajectory clustering with mixtures of regression models , 1999, KDD '99.

[41]  Adolfo Martínez Usó,et al.  A Semi-supervised Gaussian Mixture Model for Image Segmentation , 2010, 2010 20th International Conference on Pattern Recognition.