Learning Hidden Variable Networks: The Information Bottleneck Approach

A central challenge in learning probabilistic graphical models is dealing with domains that involve hidden variables. The common approach for learning model parameters in such domains is the expectation maximization (EM) algorithm. This algorithm, however, can easily get trapped in sub-optimal local maxima. Learning the model structure is even more challenging. The structural EM algorithm can adapt the structure in the presence of hidden variables, but usually performs poorly without prior knowledge about the cardinality and location of the hidden variables. In this work, we present a general approach for learning Bayesian networks with hidden variables that overcomes these problems. The approach builds on the information bottleneck framework of Tishby et al. (1999). We start by proving formal correspondence between the information bottleneck objective and the standard parametric EM functional. We then use this correspondence to construct a learning algorithm that combines an information-theoretic smoothing term with a continuation procedure. Intuitively, the algorithm bypasses local maxima and achieves superior solutions by following a continuous path from a solution of, an easy and smooth, target function, to a solution of the desired likelihood function. As we show, our algorithmic framework allows learning of the parameters as well as the structure of a network. In addition, it also allows us to introduce new hidden variables during model selection and learn their cardinality. We demonstrate the performance of our procedure on several challenging real-life data sets.

[1]  C. Reeves Modern heuristic techniques for combinatorial problems , 1993 .

[2]  Naftali Tishby,et al.  Data Clustering by Markovian Relaxation and the Information Bottleneck Method , 2000, NIPS.

[3]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[4]  Nevin Lianwen Zhang,et al.  Hierarchical latent class models for cluster analysis , 2002, J. Mach. Learn. Res..

[5]  Dale Schuurmans,et al.  Data perturbation for escaping local maxima in learning , 2002, AAAI/IAAI.

[6]  James Kelly,et al.  AutoClass: A Bayesian Classification System , 1993, ML.

[7]  Wai Lam,et al.  LEARNING BAYESIAN BELIEF NETWORKS: AN APPROACH BASED ON THE MDL PRINCIPLE , 1994, Comput. Intell..

[8]  Naftali Tishby,et al.  Agglomerative Information Bottleneck , 1999, NIPS.

[9]  Noam Slonim,et al.  Maximum Likelihood and the Information Bottleneck , 2002, NIPS.

[10]  Naonori Ueda,et al.  Deterministic annealing EM algorithm , 1998, Neural Networks.

[11]  Richard E. Neapolitan,et al.  Learning Bayesian networks , 2007, KDD '07.

[12]  Bo Thiesson,et al.  Learning Mixtures of Bayesian Networks , 1997, UAI 1997.

[13]  Adrian Corduneanu,et al.  Continuation Methods for Mixing Heterogenous Sources , 2002, UAI.

[14]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[15]  Amos Storkey,et al.  Advances in Neural Information Processing Systems 20 , 2007 .

[16]  David Maxwell Chickering,et al.  Efficient Approximations for the Marginal Likelihood of Bayesian Networks with Hidden Variables , 1997, Machine Learning.

[17]  D. Botstein,et al.  Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.

[18]  Andreas Stolcke,et al.  Hidden Markov Model} Induction by Bayesian Model Merging , 1992, NIPS.

[19]  K. Rose Deterministic annealing for clustering, compression, classification, regression, and related optimization problems , 1998, Proc. IEEE.

[20]  Michael I. Jordan,et al.  Estimating Dependency Structure as a Hidden Variable , 1997, NIPS.

[21]  M. Degroot Optimal Statistical Decisions , 1970 .

[22]  Tommi S. Jaakkola,et al.  Information Regularization with Partially Labeled Data , 2002, NIPS.

[23]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.

[24]  S. Lauritzen The EM algorithm for graphical association models with missing data , 1995 .

[25]  Nir Friedman,et al.  Discovering Hidden Variables: A Structure-Based Approach , 2000, NIPS.

[26]  P. Spirtes,et al.  Causation, prediction, and search , 1993 .

[27]  Bo Thiesson,et al.  Learning Mixtures of DAG Models , 1998, UAI.

[28]  Naftali Tishby,et al.  Agglomerative Multivariate Information Bottleneck , 2001, NIPS.

[29]  Naftali Tishby,et al.  Multivariate Information Bottleneck , 2001, Neural Computation.

[30]  David Maxwell Chickering,et al.  Learning Equivalence Classes of Bayesian Network Structures , 1996, UAI.

[31]  Noah A. Smith,et al.  Annealing Techniques For Unsupervised Statistical Language Learning , 2004, ACL.

[32]  Kuo-Chu Chang,et al.  Refinement and coarsening of Bayesian networks , 1990, UAI.

[33]  Nir Friedman,et al.  Incorporating Expressive Graphical Models in VariationalApproximations: Chain-graphs and Hidden Variables , 2001, UAI.

[34]  Nir Friedman,et al.  Learning Belief Networks in the Presence of Missing Values and Hidden Variables , 1997, ICML.

[35]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[36]  J. Adachi,et al.  MOLPHY version 2.3 : programs for molecular phylogenetics based on maximum likelihood , 1996 .

[37]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[38]  Radford M. Neal A new view of the EM algorithm that justifies incremental and other variants , 1993 .

[39]  Dm Titterington,et al.  Applying the deterministic annealing expectation maximization algorithm to Naive Bayes networks , 2002 .

[40]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .

[41]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[42]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[43]  Bo Thiesson,et al.  Score and Information for Recursive Exponential Models with Incomplete Data , 1997, UAI.

[44]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[45]  Layne T. Watson,et al.  Theory of Globally Convergent Probability-One Homotopies for Nonlinear Programming , 2000, SIAM J. Optim..

[46]  Xavier Boyen,et al.  Discovering the Hidden Structure of Complex Dynamic Systems , 1999, UAI.

[47]  Nir Friedman,et al.  Learning the Dimensionality of Hidden Variables , 2001, UAI.

[48]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.