Classtering: Joint Classification and Clustering with Mixture of Factor Analysers

In this work we propose a novel parametric Bayesian model for the problem of semi-supervised classification and clustering. Standard approaches of semi-supervised classification can recognize classes but cannot find groups of data. On the other hand, semi-supervised clustering techniques are able to discover groups of data but cannot find the associations between clusters and classes. The proposed model can classify and cluster samples simultaneously, allowing the analysis of data in the presence of an unknown number of classes and/or an arbitrary number of clusters per class. Experiments on synthetic and real world data show that the proposed model compares favourably to state-of-the-art approaches for semi-supervised clustering and that the discovered clusters can help to enhance classification performance, even in cases where the cluster and the low density separation assumptions do not hold. We finally show that when applied to a challenging real-world problem of subgroup discovery in breast cancer, the method is capable of maximally exploiting the limited information available and identifying highly promising subgroups.

[1]  A. Nobel,et al.  Supervised risk predictor of breast cancer based on intrinsic subtypes. , 2009, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[2]  Matthew J. Beal Variational algorithms for approximate Bayesian inference , 2003 .

[3]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[4]  George Kesidis,et al.  Instance-Level Constraint-Based Semisupervised Learning With Imposed Space-Partitioning , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[5]  Zhenguo Li,et al.  Pairwise constraint propagation by semidefinite programming for semi-supervised classification , 2008, ICML '08.

[6]  Charles Bouveyron,et al.  Model-based clustering of high-dimensional data: A review , 2014, Comput. Stat. Data Anal..

[7]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[8]  Tomer Hertz,et al.  Computing Gaussian Mixture Models with EM Using Equivalence Constraints , 2003, NIPS.

[9]  Angelika Garz,et al.  ICDAR 2013 Competition on Handwritten Digit Recognition (HDRC 2013) , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[10]  Peter Meer,et al.  Semi-Supervised Kernel Mean Shift Clustering , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  A. Zimek,et al.  On Using Class-Labels in Evaluation of Clusterings , 2010 .

[12]  Bernhard Schölkopf,et al.  Introduction to Semi-Supervised Learning , 2006, Semi-Supervised Learning.

[13]  Zoubin Ghahramani,et al.  Variational Inference for Bayesian Mixtures of Factor Analysers , 1999, NIPS.

[14]  Jinfeng Yi,et al.  Semi-supervised Clustering by Input Pattern Assisted Pairwise Similarity Matrix Completion , 2013, ICML.

[15]  Charles M. Bishop Variational principal components , 1999 .

[16]  N. Craddock,et al.  Gene-wide analyses of genome-wide association data sets: evidence for multiple common risk alleles for schizophrenia and bipolar disorder and for overlap in genetic risk , 2009, Molecular Psychiatry.

[17]  Yoshua Bengio,et al.  Semi-supervised Learning by Entropy Minimization , 2004, CAP.

[18]  Geoffrey J. McLachlan,et al.  A mixture model-based approach to the clustering of microarray expression data , 2002, Bioinform..

[19]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[20]  Alexander Zien,et al.  Semi-Supervised Classification by Low Density Separation , 2005, AISTATS.

[21]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[22]  Anil K. Jain,et al.  Model-based Clustering With Probabilistic Constraints , 2005, SDM.

[23]  Adam B. Olshen,et al.  Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis , 2009, Bioinform..

[24]  Zhengdong Lu Semi-supervised Clustering with Pairwise Constraints: A Discriminative Approach , 2007, AISTATS.

[25]  Arindam Banerjee,et al.  Semi-supervised Clustering by Seeding , 2002, ICML.

[26]  Jakob J. Verbeek,et al.  Learning nonlinear image manifolds by global alignment of local linear models , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  David J. Miller,et al.  A Mixture Model and EM-Based Algorithm for Class Discovery, Robust Classification, and Outlier Rejection in Mixed Labeled/Unlabeled Data Sets , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[28]  Raymond J. Mooney,et al.  Integrating constraints and metric learning in semi-supervised clustering , 2004, ICML.

[29]  George Kesidis,et al.  Improved Generative Semisupervised Learning Based on Finely Grained Component-Conditional Class Labeling , 2012, Neural Computation.

[30]  Rui Li,et al.  Monocular Tracking of 3D Human Motion with a Coordinated Mixture of Factor Analyzers , 2006, ECCV.

[31]  Amy V Kapp,et al.  Are clusters found in one dataset present in another dataset? , 2007, Biostatistics.

[32]  Hagai Attias,et al.  Inferring Parameters and Structure of Latent Variable Models by Variational Bayes , 1999, UAI.

[33]  F. Markowetz,et al.  The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups , 2012, Nature.

[34]  Mahdieh Soleymani Baghshah,et al.  Semi-Supervised Metric Learning Using Pairwise Constraints , 2009, IJCAI.

[35]  Mikhail Belkin,et al.  Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples , 2006, J. Mach. Learn. Res..