论文信息 - A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models - 字舞流文

A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models

We describe the maximum-likelihood parameter estimation problem and how the ExpectationMaximization (EM) algorithm can be used for its solution. We first describe the abstract form of the EM algorithm as it is often given in the literature. We then develop the EM parameter estimation procedure for two applications: 1) finding the parameters of a mixture of Gaussian densities, and 2) finding the parameters of a hidden Markov model (HMM) (i.e., the Baum-Welch algorithm) for both discrete and Gaussian mixture observation models. We derive the update equations in fairly explicit detail but we do not prove any convergence properties. We try to emphasize intuition rather than mathematical rigor.

Jeff A. Bilmes | J. Bilmes | L. Cranor | Joseph M. Reagle | Jeffrey A. Bilmes

[1] I. J. Schoenberg,et al. Metric spaces and positive definite functions , 1938 .

[2] N. Aronszajn. Theory of Reproducing Kernels. , 1950 .

[3] P. D. Thompson. Optimum Smoothing of Two-Dimensional Fields , 1956 .

[4] P. Mazur. On the theory of brownian motion , 1959 .

[5] Richard Von Mises,et al. Mathematical Theory of Probability and Statistics , 1966 .

[6] I J Schoenberg,et al. SPLINE FUNCTIONS AND THE PROBLEM OF GRADUATION. , 1964, Proceedings of the National Academy of Sciences of the United States of America.

[7] Norbert Wiener,et al. Extrapolation, Interpolation, and Smoothing of Stationary Time Series , 1964 .

[8] H. D. Miller,et al. The Theory Of Stochastic Processes , 1977, The Mathematical Gazette.

[9] G. Arfken. Mathematical Methods for Physicists , 1967 .

[10] L. Shepp. Radon-Nikodym Derivatives of Gaussian Measures , 1966 .

[11] G. Wahba,et al. A Correspondence Between Bayesian Estimation on Stochastic Processes and Smoothing by Splines , 1970 .

[12] G. Wahba,et al. Some results on Tchebycheffian spline functions , 1971 .

[13] Thomas Kailath,et al. RKHS approach to detection and estimation problems-I: Deterministic signals in Gaussian noise , 1971, IEEE Trans. Inf. Theory.

[14] R. Mazo. On the theory of brownian motion , 1973 .

[15] Richard O. Duda,et al. Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[16] Ian F. Blake,et al. Level-crossing problems for random processes , 1973, IEEE Trans. Inf. Theory.

[17] G. Wahba. Smoothing noisy data with spline functions , 1975 .

[18] B. Blight,et al. A Bayesian approach to model inadequacy for polynomial regression , 1975 .

[19] Jean Duchon,et al. Splines minimizing rotation-invariant semi-norms in Sobolev spaces , 1976, Constructive Theory of Functions of Several Variables.

[20] A P Dawid,et al. Properties of diagnostic data distributions. , 1976, Biometrics.

[21] A. N. Tikhonov,et al. Solutions of ill-posed problems , 1977 .

[22] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[23] Benoit B. Mandelbrot,et al. Fractal Geometry of Nature , 1984 .

[24] B. Silverman,et al. Density Ratios, Empirical Likelihood and Cot Death , 1978 .

[25] Peter Craven,et al. Smoothing noisy data with spline functions , 1978 .

[26] A. O'Hagan,et al. Curve Fitting and Optimal Design for Prediction , 1978 .

[27] R. Taylor,et al. The Numerical Treatment of Integral Equations , 1978 .

[28] S. Geisser,et al. A Predictive Approach to Model Selection , 1979 .

[29] G. Wahba,et al. Design Problems for Optimal Surface Interpolation. , 1979 .

[30] Eugene Wong,et al. Stochastic processes in information and dynamical systems , 1979 .

[31] Temple F. Smith. Occam's razor , 1980, Nature.

[32] Chris Chatfield,et al. The Analysis of Time Series: An Introduction , 1981 .

[33] Rama Chellappa,et al. Stochastic models for closed boundary analysis: Representation and reconstruction , 1981, IEEE Trans. Inf. Theory.

[34] K. Stüben,et al. Multigrid methods: Fundamental algorithms, model problem analysis and applications , 1982 .

[35] M. Arató. Linear Stochastic Systems with Constant Coefficients , 1982 .

[36] G. Grimmett,et al. Probability and random processes , 2002 .

[37] P. Whittle. Prediction and Regulation by Linear Least-Square Methods , 1983 .

[38] New York Dover,et al. ON THE CONVERGENCE PROPERTIES OF THE EM ALGORITHM , 1983 .

[39] Gene H. Golub,et al. Matrix computations , 1983 .

[40] Leslie G. Valiant,et al. A theory of the learnable , 1984, STOC '84.

[41] Donald Geman,et al. Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42] R. Redner,et al. Mixture densities, maximum likelihood, and the EM algorithm , 1984 .

[43] P. Rousseeuw. Least Median of Squares Regression , 1984 .

[44] B. Silverman,et al. Spline Smoothing: The Equivalent Variable Kernel Method , 1984 .

[45] D. Cox. MULTIVARIATE SMOOTHING SPLINE FUNCTIONS , 1984 .

[46] B. Silverman,et al. Some Aspects of the Spline Smoothing Approach to Non‐Parametric Regression Curve Fitting , 1985 .

[47] G. Wahba. A Comparison of GCV and GML for Choosing the Smoothing Parameter in the Generalized Spline Smoothing Problem , 1985 .

[48] B. Øksendal. Stochastic Differential Equations , 1985 .

[49] B. Yandell,et al. Automatic Smoothing of Regression Functions in Generalized Linear Models , 1986 .

[50] D. Freedman,et al. On the consistency of Bayes estimates , 1986 .

[51] H. König. Eigenvalue Distribution of Compact Operators , 1986 .

[52] A. Yaglom. Correlation Theory of Stationary and Related Random Functions I: Basic Results , 1987 .

[53] Richard Szeliski,et al. Regularization Uses Fractal Priors , 1987, AAAI.

[54] R. Kohn,et al. A new algorithm for spline smoothing based on smoothing a stochastic process , 1987 .

[55] Alan L. Yuille,et al. A regularized solution to edge detection , 1985, J. Complex..

[56] Gerard Salton,et al. Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[57] D. F. Hays,et al. Table of Integrals, Series, and Products , 1966 .

[58] D. L. Hawkins. Some practical problems in implementing a certain sieve estimator of the gaussian mean function , 1989 .

[59] F. Girosi,et al. Networks for approximation and learning , 1990, Proc. IEEE.

[60] P. Diggle. Time Series: A Biostatistical Introduction , 1990 .

[61] G. Wahba. Spline models for observational data , 1990 .

[62] D. Cox,et al. Asymptotic Analysis of Penalized Likelihood and Related Estimators , 1990 .

[63] Ulf Grenander,et al. Hands: A Pattern Theoretic Study of Biological Shapes , 1990 .

[64] R. Tibshirani,et al. Generalized Additive Models , 1991 .

[65] F. Girosi. Models of Noise and Robust Estimates , 1991 .

[66] N. Cressie,et al. Statistics for Spatial Data. , 1992 .

[67] W. Hackbusch. Iterative Lösung großer schwachbesetzter Gleichungssysteme , 1991 .

[68] R. Daley. Atmospheric Data Analysis , 1991 .

[69] M. Stein. A kernel approximation to the kriging predictor of a spatial process , 1991 .

[70] F. Girosi. Models of Noise and Robust Estimation , 1991 .

[71] Yann LeCun,et al. Tangent Prop - A Formalism for Specifying Selected Invariances in an Adaptive Network , 1991, NIPS.

[72] David J. C. MacKay,et al. The Evidence Framework Applied to Classification Networks , 1992, Neural Computation.

[73] Heekuck Oh,et al. Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[74] Biing-Hwang Juang,et al. Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[75] Robert A. Jacobs,et al. Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[76] C. D. Keeling,et al. Atmospheric CO 2 records from sites in the SIO air sampling network , 1994 .

[77] B. Silverman,et al. Nonparametric regression and generalized linear models , 1994 .

[78] Michael I. Jordan,et al. Learning from Incomplete Data , 1994 .

[79] Carl E. Rasmussen,et al. In Advances in Neural Information Processing Systems , 2011 .

[80] R. Berk,et al. Continuous Univariate Distributions, Volume 2 , 1995 .

[81] Gerhard Winkler,et al. Image analysis, random fields and dynamic Monte Carlo methods: a mathematical introduction , 1995, Applications of mathematics.

[82] K. Ritter,et al. MULTIVARIATE INTEGRATION AND APPROXIMATION FOR RANDOM FIELDS SATISFYING SACKS-YLVISAKER CONDITIONS , 1995 .

[83] R. Bartle. The elements of integration and Lebesgue measure , 1995 .

[84] Geoffrey E. Hinton,et al. Bayesian Learning for Neural Networks , 1995 .

[85] Feng Gao,et al. Adaptive Tuning of Numerical Weather Prediction Models: Randomized GCV in Three- and Four-Dimensional Data Assimilation , 1995 .

[86] Tomaso A. Poggio,et al. Regularization Theory and Neural Networks Architectures , 1995, Neural Computation.

[87] Michael I. Jordan,et al. Convergence results for the EM approach to mixtures of experts architectures , 1995, Neural Networks.

[88] Leszek Plaskota,et al. Noisy information and computational complexity , 1996 .

[89] Brian D. Ripley,et al. Pattern Recognition and Neural Networks , 1996 .

[90] László Györfi,et al. A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[91] Michael I. Jordan,et al. On Convergence Properties of the EM Algorithm for Gaussian Mixtures , 1996, Neural Computation.

[92] Wei-Pai Tang,et al. An Overdetermined Schwarz Alternating Method , 1996, SIAM J. Sci. Comput..

[93] P. R. Nelson. Continuous Univariate Distributions Volume 2 , 1996 .

[94] Thomas G. Dietterich. What is machine learning? , 2020, Archives of Disease in Childhood.

[95] G. Wahba,et al. Hybrid Adaptive Splines , 1997 .

[96] L. K. Hansen,et al. The Error-Reject Tradeoff , 1997 .

[97] Paul W. Goldberg,et al. Regression with Input-dependent Noise: A Gaussian Process Treatment , 1997, NIPS.

[98] Geoffrey E. Hinton,et al. Evaluation of Gaussian processes and other methods for non-linear regression , 1997 .

[99] M. Gibbs,et al. Efficient implementation of gaussian processes , 1997 .

[100] Radford M. Neal. Monte Carlo Implementation of Gaussian Process Models for Bayesian Regression and Classification , 1997, physics/9701026.

[101] Christopher K. I. Williams,et al. Gaussian regression and optimal finite dimensional linear models , 1997 .

[102] Christopher K. I. Williams,et al. Discovering Hidden Features with Gaussian Processes Regression , 1998, NIPS.

[103] Christopher M. Bishop,et al. GTM: The Generative Topographic Mapping , 1998, Neural Computation.

[104] Vladimir Vapnik,et al. Statistical learning theory , 1998 .

[105] D. Mackay,et al. Introduction to Gaussian processes , 1998 .

[106] Michael Jung,et al. Parallelization of multi-grid methods based on domain decomposition ideas , 1998 .

[107] Peter Sollich,et al. Learning Curves for Gaussian Processes , 1998, NIPS.

[108] Alexander Gammerman,et al. Ridge Regression Learning Algorithm in Dual Variables , 1998, ICML.

[109] Christopher K. I. Williams. Computation with Infinite Neural Networks , 1998, Neural Computation.

[110] James O. Berger,et al. Uncertainty analysis and other inference tools for complex computer codes , 1998 .

[111] Manfred Opper,et al. Finite-Dimensional Approximation of Gaussian Processes , 1998, NIPS.

[112] Manfred Opper,et al. General Bounds on Bayes Errors for Regression with Gaussian Processes , 1998, NIPS.

[113] Alexander J. Smola,et al. Learning with kernels , 1998 .

[114] Sally Wood,et al. A Bayesian Approach to Robust Binary Nonparametric Regression , 1998 .

[115] David Barber,et al. Bayesian Classification With Gaussian Processes , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[116] D. Freedman. On the Bernstein-von Mises Theorem with Infinite Dimensional Parameters , 1999 .

[117] J. Weston,et al. Support vector regression with ANOVA decomposition kernels , 1999 .

[118] John C. Platt,et al. Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[119] Matthias W. Seeger,et al. Bayesian Model Selection for Support Vector Machines, Gaussian Processes and Other Kernel Classifiers , 1999, NIPS.

[120] C. Watkins. Dynamic Alignment Kernels , 1999 .

[121] David Haussler,et al. Probabilistic kernel regression models , 1999, AISTATS.

[122] David J. C. MacKay,et al. Comparison of Approximate Methods for Handling Hyperparameters , 1999, Neural Computation.

[123] Xiwu Lin,et al. Smoothing spline ANOVA models for large data sets with Bernoulli observations and the randomized GACV , 2000 .

[124] Massimiliano Pontil,et al. On the Noise Model of Support Vector Machines Regression , 2000, ALT.

[125] D. Kammler. A First Course in Fourier Analysis , 2000 .

[126] David J. C. MacKay,et al. Variational Gaussian process classifiers , 2000, IEEE Trans. Neural Networks Learn. Syst..

[127] Vladimir N. Vapnik,et al. The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[128] Alexander J. Smola,et al. Sparse Greedy Gaussian Process Regression , 2000, NIPS.

[129] B. Schölkopf,et al. Sparse Greedy Matrix Approximation for Machine Learning , 2000, ICML.

[130] Klaus Ritter,et al. Average-case analysis of numerical problems , 2000, Lecture notes in mathematics.

[131] Christopher K. I. Williams,et al. Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[132] P. Bartlett,et al. Probabilities for SV Machines , 2000 .

[133] A. E. Hoerl,et al. Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[134] Ole Winther,et al. Gaussian Processes for Classification: Mean-Field Algorithms , 2000, Neural Computation.

[135] R. Shah,et al. Least Squares Support Vector Machines , 2022 .