Graphical models: parameter learning

“Graphical models” combine graph theory and probability theory to provide a general framework for representing models in which a number of variables interact. Graphical models trace their origins to many different fields and have been applied in wide variety of settings: for example, to develop probabilistic expert systems, to understand neural network models, to infer trait inheritance in geneologies, to model images, to correct errors in digital communication, or to solve complex decision problems. Remarkably, the same formalisms and algorithms can be applied to this wide range of problems. Each node in the graph represent a random variable (or more generally a set of random variables). The pattern of edges in the graph represents the qualitative dependencies between the variables; the absence of an edge between two nodes means that any statistical dependency between these two variables is mediated via some other variable or set of variables. The quantitative dependencies between variables which are connected via edges are specified via parameterized conditional distributions, or more generally non-negative “potential functions”. The pattern of edges and the potential functions together specify a joint probability distribution over all the variables in the graph. We refer to the pattern of edges as the structure of the graph, while the parameters of the potential functions simply as the parameters of the graph. In this chapter, we assume that the structure of the graph is given, and that our goal is to learn the parameters of the graph from data. Solutions to the problem of learning the graph structure from data are given in GRAPHICAL MODELS, STRUCTURE LEARNING. We briefly review some of the notation from PROBABILISTIC INFERENCE IN GRAPHICAL MODELS which we will need to cover parameter learning in graphical models; we assume that the reader is familiar with the contents

[1]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[2]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[3]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[4]  Geoffrey E. Hinton,et al.  A Learning Algorithm for Boltzmann Machines , 1985, Cogn. Sci..

[5]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[6]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[7]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[8]  William J. Byrne,et al.  Alternating minimization and Boltzmann machine learning , 1992, IEEE Trans. Neural Networks.

[9]  Radford M. Neal A new view of the EM algorithm that justifies incremental and other variants , 1993 .

[10]  R. Jirousek,et al.  On the effective implementation of the iterative proportional fitting procedure , 1995 .

[11]  S. Lauritzen The EM algorithm for graphical association models with missing data , 1995 .

[12]  David Heckerman,et al.  A Tutorial on Learning with Bayesian Networks , 1998, Learning in Graphical Models.

[13]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[14]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine-mediated learning.

[15]  Zoubin Ghahramani,et al.  A Unifying Review of Linear Gaussian Models , 1999, Neural Computation.

[16]  Hagai Attias,et al.  Inferring Parameters and Structure of Latent Variable Models by Variational Bayes , 1999, UAI.

[17]  David J. Spiegelhalter,et al.  Probabilistic Networks and Expert Systems , 1999, Information Science and Statistics.

[18]  Zoubin Ghahramani,et al.  Propagation Algorithms for Variational Bayesian Learning , 2000, NIPS.

[19]  Tom Minka,et al.  Expectation Propagation for approximate Bayesian inference , 2001, UAI.

[20]  Yee Whye Teh,et al.  The Unified Propagation and Scaling Algorithm , 2001, NIPS.

[21]  Thomas G. Dietterich,et al.  Editors. Advances in Neural Information Processing Systems , 2002 .