Simultaneous Gaussian model-based clustering for samples of multiple origins

Gaussian mixture model-based clustering is now a standard tool to estimate some hypothetical underlying partition of a single dataset. In this paper, we aim to cluster several different datasets at the same time in a context where underlying populations, even though different, are not completely unrelated: All individuals are described by the same features and partitions of identical meaning are expected. Justifying from some natural arguments a stochastic linear link between the components of the mixtures associated to each dataset, we propose some parsimonious and meaningful models for a so-called simultaneous clustering method. Maximum likelihood mixture parameters, subject to the linear link constraint, can be easily estimated by a Generalized Expectation Maximization algorithm that we describe. Some promising results are obtained in a biological context where simultaneous clustering outperforms independent clustering for partitioning three different subspecies of birds. Further results on ornithological data show that the proposed strategy is robust to the relaxation of the exact descriptor concordance which is one of its main assumptions.

[1]  L. Wasserman,et al.  Practical Bayesian Density Estimation Using Mixtures of Normals , 1997 .

[2]  J. Gower Generalized procrustes analysis , 1975 .

[3]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[4]  Marc Yor,et al.  On independent times and positions for Brownian motions , 2002 .

[5]  Jay Magidson,et al.  Hierarchical Mixture Models for Nested Data Structures , 2004, GfKl.

[6]  C. Biernacki,et al.  A Generalized Discriminant Rule When Training Population and Test Population Differ on Their Descriptive Parameters , 2002, Biometrics.

[7]  Gérard Govaert,et al.  Model-based cluster and discriminant analysis with the MIXMOD software , 2006, Comput. Stat. Data Anal..

[8]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[9]  Eric Séverin,et al.  Dynamic analysis of the business failure process: A study of bankruptcy trajectories , 2010 .

[10]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[11]  Emilie Lebarbier,et al.  Le critère BIC : fondements théoriques et interprétation , 2004 .

[12]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[13]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[14]  Geoffrey J. McLachlan,et al.  Extension of the mixture of factor analyzers model to incorporate the multivariate t-distribution , 2007, Comput. Stat. Data Anal..

[15]  D B Allison,et al.  Mixture distributions in human genetics research , 1996, Statistical methods in medical research.

[16]  Gérard Govaert,et al.  Gaussian parsimonious clustering models , 1995, Pattern Recognit..

[17]  Alexandre Lourme Contribution à la classification par modèles de mélange et classification simultanée d’échantillons d’origines multiples , 2011 .

[18]  Peter Schlattmann,et al.  Estimating the number of components in a finite mixture model: the special case of homogeneity , 2003, Comput. Stat. Data Anal..

[19]  Gérard Govaert,et al.  Block clustering with Bernoulli mixture models: Comparison of different approaches , 2008, Comput. Stat. Data Anal..

[20]  Emilie Lebarbier,et al.  Une introduction au critère BIC : fondements théoriques et interprétation , 2006 .

[21]  B. Flury Common Principal Components in k Groups , 1984 .