EM Algorithm

The EM algorithm (Dempster et al., 1977) is one of the most widely used algorithms in statistics. Every year, 200{ 300 research papers are published in which EM is the topic or the main tool. Applications range from nding new types of stars to separating out diierent types of tissue in X-ray images to identifying categories of consumers from their buying behaviour. Neural networks and belief networks can be trained using EM as well as more \traditional" gradient-based methods. McLachlan and Krishnan (1997) devote an entire book to EM. Given some data X and a model family parameterized by , the goal of EM in its basic form is to nd such that the likelihood P (Xj) is maximized. In general, EM can nd only a local maximum. Each cycle revises the value of so as to increase the likelihood until a maximum is reached. The purpose of this document is to derive the algorithm in its most general form from rst principles and to give a short proof of its convergence. The derivation extends the mixture-model derivation from Bishop (1995, pp. 65{66) and leads to the algorithm given in Mitchell (1997, p.195). Suppose we deene the log likelihood function L() = ln P (Xj) and suppose that our current estimate for the optimal parameters is i. We will examine what happens to L when a new value is computed by the algorithm: L() ? L(i) = lnP(Xj) ? ln P (Xj i) = ln P (Xj) P (Xj i) Depending on what we choose for , the value of L could go up or down. We would like to choose to maximize the right-hand side of the equation above. In general, this cannot be done; the core idea of EM is to introduce some unobserved variables Z, appropriate for the model family under consideration, such that if Z were known the optimal value could be computed easily. Mathematically, Z is brought into the equations by conditioning: