A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models

We describe the maximum-likelihood parameter estimation problem and how the ExpectationMaximization (EM) algorithm can be used for its solution. We first describe the abstract form of the EM algorithm as it is often given in the literature. We then develop the EM parameter estimation procedure for two applications: 1) finding the parameters of a mixture of Gaussian densities, and 2) finding the parameters of a hidden Markov model (HMM) (i.e., the Baum-Welch algorithm) for both discrete and Gaussian mixture observation models. We derive the update equations in fairly explicit detail but we do not prove any convergence properties. We try to emphasize intuition rather than mathematical rigor.

[1]  I. J. Schoenberg,et al.  Metric spaces and positive definite functions , 1938 .

[2]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[3]  P. D. Thompson Optimum Smoothing of Two-Dimensional Fields , 1956 .

[4]  P. Mazur On the theory of brownian motion , 1959 .

[5]  Richard Von Mises,et al.  Mathematical Theory of Probability and Statistics , 1966 .

[6]  I J Schoenberg,et al.  SPLINE FUNCTIONS AND THE PROBLEM OF GRADUATION. , 1964, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Norbert Wiener,et al.  Extrapolation, Interpolation, and Smoothing of Stationary Time Series , 1964 .

[8]  H. D. Miller,et al.  The Theory Of Stochastic Processes , 1977, The Mathematical Gazette.

[9]  G. Arfken Mathematical Methods for Physicists , 1967 .

[10]  L. Shepp Radon-Nikodym Derivatives of Gaussian Measures , 1966 .

[11]  G. Wahba,et al.  A Correspondence Between Bayesian Estimation on Stochastic Processes and Smoothing by Splines , 1970 .

[12]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[13]  Thomas Kailath,et al.  RKHS approach to detection and estimation problems-I: Deterministic signals in Gaussian noise , 1971, IEEE Trans. Inf. Theory.

[14]  R. Mazo On the theory of brownian motion , 1973 .

[15]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[16]  Ian F. Blake,et al.  Level-crossing problems for random processes , 1973, IEEE Trans. Inf. Theory.

[17]  G. Wahba Smoothing noisy data with spline functions , 1975 .

[18]  B. Blight,et al.  A Bayesian approach to model inadequacy for polynomial regression , 1975 .

[19]  Jean Duchon,et al.  Splines minimizing rotation-invariant semi-norms in Sobolev spaces , 1976, Constructive Theory of Functions of Several Variables.

[20]  A P Dawid,et al.  Properties of diagnostic data distributions. , 1976, Biometrics.

[21]  A. N. Tikhonov,et al.  Solutions of ill-posed problems , 1977 .

[22]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[23]  Benoit B. Mandelbrot,et al.  Fractal Geometry of Nature , 1984 .

[24]  B. Silverman,et al.  Density Ratios, Empirical Likelihood and Cot Death , 1978 .

[25]  Peter Craven,et al.  Smoothing noisy data with spline functions , 1978 .

[26]  A. O'Hagan,et al.  Curve Fitting and Optimal Design for Prediction , 1978 .

[27]  R. Taylor,et al.  The Numerical Treatment of Integral Equations , 1978 .

[28]  S. Geisser,et al.  A Predictive Approach to Model Selection , 1979 .

[29]  G. Wahba,et al.  Design Problems for Optimal Surface Interpolation. , 1979 .

[30]  Eugene Wong,et al.  Stochastic processes in information and dynamical systems , 1979 .

[31]  Temple F. Smith Occam's razor , 1980, Nature.

[32]  Chris Chatfield,et al.  The Analysis of Time Series: An Introduction , 1981 .

[33]  Rama Chellappa,et al.  Stochastic models for closed boundary analysis: Representation and reconstruction , 1981, IEEE Trans. Inf. Theory.

[34]  K. Stüben,et al.  Multigrid methods: Fundamental algorithms, model problem analysis and applications , 1982 .

[35]  M. Arató Linear Stochastic Systems with Constant Coefficients , 1982 .

[36]  G. Grimmett,et al.  Probability and random processes , 2002 .

[37]  P. Whittle Prediction and Regulation by Linear Least-Square Methods , 1983 .

[38]  New York Dover,et al.  ON THE CONVERGENCE PROPERTIES OF THE EM ALGORITHM , 1983 .

[39]  Gene H. Golub,et al.  Matrix computations , 1983 .

[40]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[41]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  R. Redner,et al.  Mixture densities, maximum likelihood, and the EM algorithm , 1984 .

[43]  P. Rousseeuw Least Median of Squares Regression , 1984 .

[44]  B. Silverman,et al.  Spline Smoothing: The Equivalent Variable Kernel Method , 1984 .

[45]  D. Cox MULTIVARIATE SMOOTHING SPLINE FUNCTIONS , 1984 .

[46]  B. Silverman,et al.  Some Aspects of the Spline Smoothing Approach to Non‐Parametric Regression Curve Fitting , 1985 .

[47]  G. Wahba A Comparison of GCV and GML for Choosing the Smoothing Parameter in the Generalized Spline Smoothing Problem , 1985 .

[48]  B. Øksendal Stochastic Differential Equations , 1985 .

[49]  B. Yandell,et al.  Automatic Smoothing of Regression Functions in Generalized Linear Models , 1986 .

[50]  D. Freedman,et al.  On the consistency of Bayes estimates , 1986 .

[51]  H. König Eigenvalue Distribution of Compact Operators , 1986 .

[52]  A. Yaglom Correlation Theory of Stationary and Related Random Functions I: Basic Results , 1987 .

[53]  Richard Szeliski,et al.  Regularization Uses Fractal Priors , 1987, AAAI.

[54]  R. Kohn,et al.  A new algorithm for spline smoothing based on smoothing a stochastic process , 1987 .

[55]  Alan L. Yuille,et al.  A regularized solution to edge detection , 1985, J. Complex..

[56]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[57]  D. F. Hays,et al.  Table of Integrals, Series, and Products , 1966 .

[58]  D. L. Hawkins Some practical problems in implementing a certain sieve estimator of the gaussian mean function , 1989 .

[59]  F. Girosi,et al.  Networks for approximation and learning , 1990, Proc. IEEE.

[60]  P. Diggle Time Series: A Biostatistical Introduction , 1990 .

[61]  G. Wahba Spline models for observational data , 1990 .

[62]  D. Cox,et al.  Asymptotic Analysis of Penalized Likelihood and Related Estimators , 1990 .

[63]  Ulf Grenander,et al.  Hands: A Pattern Theoretic Study of Biological Shapes , 1990 .

[64]  R. Tibshirani,et al.  Generalized Additive Models , 1991 .

[65]  F. Girosi Models of Noise and Robust Estimates , 1991 .

[66]  N. Cressie,et al.  Statistics for Spatial Data. , 1992 .

[67]  W. Hackbusch Iterative Lösung großer schwachbesetzter Gleichungssysteme , 1991 .

[68]  R. Daley Atmospheric Data Analysis , 1991 .

[69]  M. Stein A kernel approximation to the kriging predictor of a spatial process , 1991 .

[70]  F. Girosi Models of Noise and Robust Estimation , 1991 .

[71]  Yann LeCun,et al.  Tangent Prop - A Formalism for Specifying Selected Invariances in an Adaptive Network , 1991, NIPS.

[72]  David J. C. MacKay,et al.  The Evidence Framework Applied to Classification Networks , 1992, Neural Computation.

[73]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[74]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[75]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[76]  C. D. Keeling,et al.  Atmospheric CO 2 records from sites in the SIO air sampling network , 1994 .

[77]  B. Silverman,et al.  Nonparametric regression and generalized linear models , 1994 .

[78]  Michael I. Jordan,et al.  Learning from Incomplete Data , 1994 .

[79]  Carl E. Rasmussen,et al.  In Advances in Neural Information Processing Systems , 2011 .

[80]  R. Berk,et al.  Continuous Univariate Distributions, Volume 2 , 1995 .

[81]  Gerhard Winkler,et al.  Image analysis, random fields and dynamic Monte Carlo methods: a mathematical introduction , 1995, Applications of mathematics.

[82]  K. Ritter,et al.  MULTIVARIATE INTEGRATION AND APPROXIMATION FOR RANDOM FIELDS SATISFYING SACKS-YLVISAKER CONDITIONS , 1995 .

[83]  R. Bartle The elements of integration and Lebesgue measure , 1995 .

[84]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[85]  Feng Gao,et al.  Adaptive Tuning of Numerical Weather Prediction Models: Randomized GCV in Three- and Four-Dimensional Data Assimilation , 1995 .

[86]  Tomaso A. Poggio,et al.  Regularization Theory and Neural Networks Architectures , 1995, Neural Computation.

[87]  Michael I. Jordan,et al.  Convergence results for the EM approach to mixtures of experts architectures , 1995, Neural Networks.

[88]  Leszek Plaskota,et al.  Noisy information and computational complexity , 1996 .

[89]  Brian D. Ripley,et al.  Pattern Recognition and Neural Networks , 1996 .

[90]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[91]  Michael I. Jordan,et al.  On Convergence Properties of the EM Algorithm for Gaussian Mixtures , 1996, Neural Computation.

[92]  Wei-Pai Tang,et al.  An Overdetermined Schwarz Alternating Method , 1996, SIAM J. Sci. Comput..

[93]  P. R. Nelson Continuous Univariate Distributions Volume 2 , 1996 .

[94]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[95]  G. Wahba,et al.  Hybrid Adaptive Splines , 1997 .

[96]  L. K. Hansen,et al.  The Error-Reject Tradeoff , 1997 .

[97]  Paul W. Goldberg,et al.  Regression with Input-dependent Noise: A Gaussian Process Treatment , 1997, NIPS.

[98]  Geoffrey E. Hinton,et al.  Evaluation of Gaussian processes and other methods for non-linear regression , 1997 .

[99]  M. Gibbs,et al.  Efficient implementation of gaussian processes , 1997 .

[100]  Radford M. Neal Monte Carlo Implementation of Gaussian Process Models for Bayesian Regression and Classification , 1997, physics/9701026.

[101]  Christopher K. I. Williams,et al.  Gaussian regression and optimal finite dimensional linear models , 1997 .

[102]  Christopher K. I. Williams,et al.  Discovering Hidden Features with Gaussian Processes Regression , 1998, NIPS.

[103]  Christopher M. Bishop,et al.  GTM: The Generative Topographic Mapping , 1998, Neural Computation.

[104]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[105]  D. Mackay,et al.  Introduction to Gaussian processes , 1998 .

[106]  Michael Jung,et al.  Parallelization of multi-grid methods based on domain decomposition ideas , 1998 .

[107]  Peter Sollich,et al.  Learning Curves for Gaussian Processes , 1998, NIPS.

[108]  Alexander Gammerman,et al.  Ridge Regression Learning Algorithm in Dual Variables , 1998, ICML.

[109]  Christopher K. I. Williams Computation with Infinite Neural Networks , 1998, Neural Computation.

[110]  James O. Berger,et al.  Uncertainty analysis and other inference tools for complex computer codes , 1998 .

[111]  Manfred Opper,et al.  Finite-Dimensional Approximation of Gaussian Processes , 1998, NIPS.

[112]  Manfred Opper,et al.  General Bounds on Bayes Errors for Regression with Gaussian Processes , 1998, NIPS.

[113]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[114]  Sally Wood,et al.  A Bayesian Approach to Robust Binary Nonparametric Regression , 1998 .

[115]  David Barber,et al.  Bayesian Classification With Gaussian Processes , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[116]  D. Freedman On the Bernstein-von Mises Theorem with Infinite Dimensional Parameters , 1999 .

[117]  J. Weston,et al.  Support vector regression with ANOVA decomposition kernels , 1999 .

[118]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[119]  Matthias W. Seeger,et al.  Bayesian Model Selection for Support Vector Machines, Gaussian Processes and Other Kernel Classifiers , 1999, NIPS.

[120]  C. Watkins Dynamic Alignment Kernels , 1999 .

[121]  David Haussler,et al.  Probabilistic kernel regression models , 1999, AISTATS.

[122]  David J. C. MacKay,et al.  Comparison of Approximate Methods for Handling Hyperparameters , 1999, Neural Computation.

[123]  Xiwu Lin,et al.  Smoothing spline ANOVA models for large data sets with Bernoulli observations and the randomized GACV , 2000 .

[124]  Massimiliano Pontil,et al.  On the Noise Model of Support Vector Machines Regression , 2000, ALT.

[125]  D. Kammler A First Course in Fourier Analysis , 2000 .

[126]  David J. C. MacKay,et al.  Variational Gaussian process classifiers , 2000, IEEE Trans. Neural Networks Learn. Syst..

[127]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[128]  Alexander J. Smola,et al.  Sparse Greedy Gaussian Process Regression , 2000, NIPS.

[129]  B. Schölkopf,et al.  Sparse Greedy Matrix Approximation for Machine Learning , 2000, ICML.

[130]  Klaus Ritter,et al.  Average-case analysis of numerical problems , 2000, Lecture notes in mathematics.

[131]  Christopher K. I. Williams,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[132]  P. Bartlett,et al.  Probabilities for SV Machines , 2000 .

[133]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[134]  Ole Winther,et al.  Gaussian Processes for Classification: Mean-Field Algorithms , 2000, Neural Computation.

[135]  R. Shah,et al.  Least Squares Support Vector Machines , 2022 .