Challenges with EM in application to weakly identifiable mixture models

We study a class of weakly identifiable location-scale mixture models for which the maximum likelihood estimates based on $n$ i.i.d. samples are known to have lower accuracy than the classical $n^{- \frac{1}{2}}$ error. We investigate whether the Expectation-Maximization (EM) algorithm also converges slowly for these models. We first demonstrate via simulation studies a broad range of over-specified mixture models for which the EM algorithm converges very slowly, both in one and higher dimensions. We provide a complete analytical characterization of this behavior for fitting data generated from a multivariate standard normal distribution using two-component Gaussian mixture with varying location and scale parameters. Our results reveal distinct regimes in the convergence behavior of EM as a function of the dimension $d$. In the multivariate setting ($d \geq 2$), when the covariance matrix is constrained to a multiple of the identity matrix, the EM algorithm converges in order $(n/d)^{\frac{1}{2}}$ steps and returns estimates that are at a Euclidean distance of order ${(n/d)^{-\frac{1}{4}}}$ and ${ (n d)^{- \frac{1}{2}}}$ from the true location and scale parameter respectively. On the other hand, in the univariate setting ($d = 1$), the EM algorithm converges in order $n^{\frac{3}{4} }$ steps and returns estimates that are at a Euclidean distance of order ${ n^{- \frac{1}{8}}}$ and ${ n^{-\frac{1} {4}}}$ from the true location and scale parameter respectively. Establishing the slow rates in the univariate setting requires a novel localization argument with two stages, with each stage involving an epoch-based argument applied to a different surrogate EM operator at the population level. We also show multivariate ($d \geq 2$) examples, involving more general covariance matrices, that exhibit the same slow rates as the univariate case.

[1]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[2]  New York Dover,et al.  ON THE CONVERGENCE PROPERTIES OF THE EM ALGORITHM , 1983 .

[3]  R. Redner,et al.  Mixture densities, maximum likelihood, and the EM algorithm , 1984 .

[4]  M. Talagrand,et al.  Probability in Banach Spaces: Isoperimetry and Processes , 1991 .

[5]  A. Goodman,et al.  On uniformly convex functions , 1991 .

[6]  Jiahua Chen Optimal Rate of Convergence for Finite Mixture Models , 1995 .

[7]  Michael I. Jordan,et al.  Convergence results for the EM approach to mixtures of experts architectures , 1995, Neural Networks.

[8]  Michael I. Jordan,et al.  On Convergence Properties of the EM Algorithm for Gaussian Mixtures , 1996, Neural Computation.

[9]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[10]  Bin Yu Assouad, Fano, and Le Cam , 1997 .

[11]  Grace L. Yang,et al.  Festschrift for Lucien Le Cam , 1997 .

[12]  Jiahua Chen,et al.  TESTS FOR HOMOGENEITY IN NORMAL MIXTURES IN THE PRESENCE OF A STRUCTURAL PARAMETER , 2000 .

[13]  Jinwen Ma,et al.  Asymptotic Convergence Rate of the EM Algorithm for Gaussian Mixtures , 2000, Neural Computation.

[14]  Lancelot F. James,et al.  Bayesian Model Selection in Finite Mixtures by Marginal Density Decompositions , 2001 .

[15]  J. Kalbfleisch,et al.  A modified likelihood ratio test for homogeneity in finite mixture models , 2001 .

[16]  Paul Tseng,et al.  An Analysis of the EM Algorithm and Entropy-Like Proximal Point Methods , 2004, Math. Oper. Res..

[17]  Alfred O. Hero,et al.  On EM algorithms and their proximal generalizations , 2008, 1201.5912.

[18]  Jiahua Chen,et al.  Hypothesis test for normal mixture models: The EM approach , 2009, 0908.3428.

[19]  K. Mengersen,et al.  Asymptotic behaviour of the posterior distribution in overfitted mixture models , 2011 .

[20]  X. Nguyen Convergence of latent mixing measures in finite and infinite mixture models , 2011, 1109.3250.

[21]  Zhaoran Wang,et al.  High Dimensional Expectation-Maximization Algorithm: Statistical Optimization and Asymptotic Normality , 2014, 1412.8729.

[22]  Martin J. Wainwright,et al.  Statistical guarantees for the EM algorithm: From population to sample-based analysis , 2014, ArXiv.

[23]  Judith Rousseau,et al.  Overfitting Bayesian Mixture Models with an Unknown Number of Components , 2015, PloS one.

[24]  H. Kasahara,et al.  Testing the Number of Components in Normal Mixture Regression Models , 2015 .

[25]  Nhat Ho,et al.  Convergence rates of parameter estimation for some weakly identifiable finite mixtures , 2016 .

[26]  Constantine Caramanis,et al.  Regularized EM Algorithms: A Unified Framework and Statistical Guarantees , 2015, NIPS.

[27]  Arian Maleki,et al.  Global Analysis of Expectation Maximization for Mixtures of Two Gaussians , 2016, NIPS.

[28]  Guang Cheng,et al.  Simultaneous Clustering and Estimation of Heterogeneous Graphical Models , 2016, J. Mach. Learn. Res..

[29]  Purnamrita Sarkar,et al.  Convergence of Gradient EM on Multi-component Mixture of Gaussians , 2017, NIPS.

[30]  Christos Tzamos,et al.  Ten Steps of EM Suffice for Mixtures of Two Gaussians , 2016, COLT.

[31]  J. Kahn,et al.  Strong identifiability and optimal minimax rates for finite mixture estimation , 2018, The Annals of Statistics.

[32]  Jing Ma,et al.  CHIME: Clustering of high-dimensional Gaussian mixtures with EM algorithm and its optimality , 2019, The Annals of Statistics.

[33]  Michael I. Jordan,et al.  Singularity, misspecification and the convergence rate of EM , 2018, The Annals of Statistics.