Bayesian Learning: Inference and the EM Algorithm

This is the first of two chapters dedicated to Bayesian learning. The main concepts and philosophy behind Bayesian inference are introduced. The evidence function and its relation to Occam’s razor rule are presented. The expectation-maximization (EM) algorithm is derived and applied to linear regression and Gaussian mixture modeling. The k -means algorithm for clustering and its affinity to Gaussian mixture modeling are discussed. Finally, the concept of probabilistic model mixing is reviewed and the notion of mixture of experts is presented.

[1]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[2]  Dan Klein,et al.  Online EM for Unsupervised Models , 2009, NAACL.

[3]  D. Hunter,et al.  A Tutorial on MM Algorithms , 2004 .

[4]  Jun Shao,et al.  Mathematical Statistics: Exercises and Solutions , 2005 .

[5]  David Mackay,et al.  Probable networks and plausible predictions - a review of practical Bayesian methods for supervised neural networks , 1995 .

[6]  Sotirios Chatzis,et al.  Robust Sequential Data Modeling Using an Outlier Tolerant Hidden Markov Model , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  R. Schapire,et al.  Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension , 1991, COLT '91.

[8]  Y. Tikochinsky,et al.  Alternative approach to maximum-entropy inference , 1984 .

[9]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[10]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[11]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[12]  R. Sundberg Maximum Likelihood Theory for Incomplete Data from an Exponential Family , 2016 .

[13]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[14]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[15]  Lars Kai Hansen,et al.  Unsupervised learning and generalization , 1996, Proceedings of International Conference on Neural Networks (ICNN'96).

[16]  Xiao-Li Meng,et al.  Maximum likelihood estimation via the ECM algorithm: A general framework , 1993 .

[17]  E. Jaynes On the rationale of maximum-entropy methods , 1982, Proceedings of the IEEE.

[18]  Jouko Lampinen,et al.  Bayesian Model Assessment and Comparison Using Cross-Validation Predictive Densities , 2002, Neural Computation.

[19]  H. Jeffreys,et al.  Theory of probability , 1896 .

[20]  C. S. Wallace,et al.  Estimation and Inference by Compact Coding , 1987 .

[21]  Maya R. Gupta,et al.  Theory and Use of the EM Algorithm , 2011, Found. Trends Signal Process..

[22]  Lars Kai Hansen,et al.  Bayesian Averaging is Well-Temperated , 1999, NIPS.

[23]  O. Cappé,et al.  On‐line expectation–maximization algorithm for latent data models , 2009 .

[24]  Eric Moulines,et al.  On‐line expectation–maximization algorithm for latent data models , 2007, ArXiv.

[25]  T. Moon The expectation-maximization algorithm , 1996, IEEE Signal Process. Mag..

[26]  Sotirios Chatzis,et al.  Signal Modeling and Classification Using a Robust Latent Space Model Based on $t$ Distributions , 2008, IEEE Transactions on Signal Processing.

[27]  R. Feynman Statistical Mechanics, A Set of Lectures , 1972 .

[28]  Michael I. Jordan,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1994, Neural Computation.

[29]  M. Stephens Dealing with label switching in mixture models , 2000 .

[30]  S. Gull Bayesian Inductive Inference and Maximum Entropy , 1988 .

[31]  Carl E. Rasmussen,et al.  Pruning from Adaptive Regularization , 1994, Neural Computation.

[32]  Shy Shoham,et al.  Robust clustering by deterministic agglomeration EM of mixtures of multivariate t-distributions , 2002, Pattern Recognit..

[33]  Xiao-Li Meng,et al.  The EM Algorithm—an Old Folk‐song Sung to a Fast New Tune , 1997 .

[34]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[35]  David J. C. MacKay,et al.  The Evidence Framework Applied to Classification Networks , 1992, Neural Computation.

[36]  Christopher M. Bishop,et al.  Robust Bayesian Mixture Modelling , 2005, ESANN.

[37]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[38]  T. Loredo From Laplace to Supernova SN 1987A: Bayesian Inference in Astrophysics , 1990 .

[39]  New York Dover,et al.  ON THE CONVERGENCE PROPERTIES OF THE EM ALGORITHM , 1983 .

[40]  É. Moulines,et al.  Convergence of a stochastic approximation version of the EM algorithm , 1999 .