The EM algorithm (Dempster et al., 1977) is one of the most widely used algorithms in statistics. Every year, 200{ 300 research papers are published in which EM is the topic or the main tool. Applications range from nding new types of stars to separating out diierent types of tissue in X-ray images to identifying categories of consumers from their buying behaviour. Neural networks and belief networks can be trained using EM as well as more \traditional" gradient-based methods. McLachlan and Krishnan (1997) devote an entire book to EM. Given some data X and a model family parameterized by , the goal of EM in its basic form is to nd such that the likelihood P (Xj) is maximized. In general, EM can nd only a local maximum. Each cycle revises the value of so as to increase the likelihood until a maximum is reached. The purpose of this document is to derive the algorithm in its most general form from rst principles and to give a short proof of its convergence. The derivation extends the mixture-model derivation from Bishop (1995, pp. 65{66) and leads to the algorithm given in Mitchell (1997, p.195). Suppose we deene the log likelihood function L() = ln P (Xj) and suppose that our current estimate for the optimal parameters is i. We will examine what happens to L when a new value is computed by the algorithm: L() ? L(i) = lnP(Xj) ? ln P (Xj i) = ln P (Xj) P (Xj i) Depending on what we choose for , the value of L could go up or down. We would like to choose to maximize the right-hand side of the equation above. In general, this cannot be done; the core idea of EM is to introduce some unobserved variables Z, appropriate for the model family under consideration, such that if Z were known the optimal value could be computed easily. Mathematically, Z is brought into the equations by conditioning:
[1]
Heekuck Oh,et al.
Neural Networks for Pattern Recognition
,
1993,
Adv. Comput..
[2]
Calyampudi R. Rao,et al.
Linear statistical inference and its applications
,
1965
.
[3]
New York Dover,et al.
ON THE CONVERGENCE PROPERTIES OF THE EM ALGORITHM
,
1983
.
[4]
D. Rubin,et al.
Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper
,
1977
.
[5]
Thomas G. Dietterich.
What is machine learning?
,
2020,
Archives of Disease in Childhood.
[6]
G. McLachlan,et al.
The EM algorithm and extensions
,
1996
.
[7]
R. Jennrich,et al.
Conjugate Gradient Acceleration of the EM Algorithm
,
1993
.
[8]
L. Baum,et al.
An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process
,
1972
.
[9]
J. Baker.
Trainable grammars for speech recognition
,
1979
.
[10]
P. Bickel,et al.
Mathematical Statistics: Basic Ideas and Selected Topics
,
1977
.