Learning and Inference in Latent Variable Graphical Models

Author(s): Ping, Wei | Advisor(s): Ihler, Alexander | Abstract: Probabilistic graphical models such as Markov random fields provide a powerful framework and tools for machine learning, especially for structured output learning. Latent variables naturally exist in many applications of these models; they may arise from partially labeled data, or be introduced to enrich model flexibility. However, the presence of latent variables presents challenges for learning and inference.For example, the standard approach of using maximum a posteriori (MAP) prediction is complicated by the fact that, in latent variable models (LVMs), we typically want to first marginalize out the latent variables, leading to an inference task called marginal MAP. Unfortunately, marginal MAP prediction can be NP-hard even on relatively simple models such as trees, and few methods have been developed in the literature. This thesis presents a class of variational bounds for marginal MAP that generalizes the popular dual-decomposition method for MAP inference, and enables an efficient block coordinate descent algorithm to solve the corresponding optimization. Similarly, when learning LVMs for structured prediction, it is critically important to maintain the effect of uncertainty over latent variables by marginalization. We propose the marginal structured SVM, which uses marginal MAP inference to properly handle that uncertainty inside the framework of max-margin learning.We then turn our attention to an important subclass of latent variable models, restricted Boltzmann machines (RBMs). RBMs are two-layer latent variable models that are widely used to capture complex distributions of observed data, including as building block for deep probabilistic models. One practical problem in RBMs is model selection: we need to determine the hidden (latent) layer size before performing learning. We propose an infinite RBM model and apply the Frank-Wolfe algorithm to solve the resulting learning problem. The resulting algorithm can be interpreted as inserting a hidden variable into a RBM model at each iteration, to easily and efficiently perform model selection during learning. We also study the role of approximate inference in RBMs and conditional RBMs. In particular, there is a common assumption that belief propagation methods do not work well on RBM-based models, especially for learning. In contrast, we demonstrate that for conditional models and structured prediction, learning RBM-based models with belief propagation and its variants can provide much better results than the state-of-the-art contrastive divergence methods.

[1]  Alexander T. Ihler,et al.  Linear Approximation to ADMM for MAP inference , 2013, ACML.

[2]  Yee Whye Teh,et al.  Approximate inference in Boltzmann machines , 2003, Artif. Intell..

[3]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[4]  Ben Taskar,et al.  Introduction to statistical relational learning , 2007 .

[5]  Tsuhan Chen,et al.  Efficient inference for fully-connected CRFs with stationarity , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Tommi S. Jaakkola,et al.  Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations , 2007, NIPS.

[7]  Claire Cardie,et al.  Multi-Level Structured Models for Document-Level Sentiment Classification , 2010, EMNLP.

[8]  Lars Otten,et al.  Join-graph based cost-shifting schemes , 2012, UAI.

[9]  Amir Globerson,et al.  Convergent message passing algorithms - a unifying view , 2009, UAI.

[10]  Michael I. Jordan,et al.  Loopy Belief Propagation for Approximate Inference: An Empirical Study , 1999, UAI.

[11]  A. Darwiche,et al.  Complexity Results and Approximation Strategies for MAP Explanations , 2011, J. Artif. Intell. Res..

[12]  Mark W. Schmidt,et al.  Block-Coordinate Frank-Wolfe Optimization for Structural SVMs , 2012, ICML.

[13]  Rina Dechter,et al.  Mini-buckets: A general scheme for bounded inference , 2003, JACM.

[14]  Tamir Hazan,et al.  Tightening Fractional Covering Upper Bounds on the Partition Function for High-Order Region Graphs , 2012, UAI.

[15]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[16]  Wei Ping,et al.  Marginal Structured SVM with Hidden Variables , 2014, ICML.

[17]  Thorsten Joachims,et al.  Learning structural SVMs with latent variables , 2009, ICML '09.

[18]  Hugo Larochelle,et al.  Loss-sensitive Training of Probabilistic Conditional Random Fields , 2011, ArXiv.

[19]  William T. Freeman,et al.  Removing camera shake from a single photograph , 2006, SIGGRAPH 2006.

[20]  Razvan Pascanu,et al.  Autotagging music with conditional restricted Boltzmann machines , 2011, ArXiv.

[21]  Geoffrey E. Hinton,et al.  A Learning Algorithm for Boltzmann Machines , 1985, Cogn. Sci..

[22]  Nikos Komodakis,et al.  MRF Energy Minimization and Beyond via Dual Decomposition , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Wei Ping,et al.  Decomposition Bounds for Marginal MAP , 2015, NIPS.

[24]  Junjie Wu,et al.  Multinomial Latent Logistic Regression for Image Understanding , 2016, IEEE Transactions on Image Processing.

[25]  Qiang Liu,et al.  Bounding the Partition Function using Holder's Inequality , 2011, ICML.

[26]  Adnan Darwiche,et al.  Solving MAP Exactly using Systematic Search , 2002, UAI.

[27]  Martin J. Wainwright,et al.  Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting , 2006, J. Mach. Learn. Res..

[28]  Qiang Liu,et al.  Variational algorithms for marginal MAP , 2011, J. Mach. Learn. Res..

[29]  David M. Bradley,et al.  Convex Coding , 2009, UAI.

[30]  W. Freeman,et al.  Generalized Belief Propagation , 2000, NIPS.

[31]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[32]  Nando de Freitas,et al.  Inductive Principles for Restricted Boltzmann Machine Learning , 2010, AISTATS.

[33]  Tommi S. Jaakkola,et al.  Tree Block Coordinate Descent for MAP in Graphical Models , 2009, AISTATS.

[34]  Geoffrey E. Hinton,et al.  Self Supervised Boosting , 2002, NIPS.

[35]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[36]  Yair Weiss,et al.  MAP Estimation, Linear Programming and Belief Propagation with Convex Free Energies , 2007, UAI.

[37]  Christopher A. Meek,et al.  Approximating Max‐Sum‐Product Problems using Multiplicative Error Bounds , 2011 .

[38]  Martin J. Wainwright,et al.  Tree-based reparameterization framework for analysis of sum-product and related algorithms , 2003, IEEE Trans. Inf. Theory.

[39]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[40]  Kazuyuki Tanaka,et al.  Approximate Learning Algorithm in Boltzmann Machines , 2009, Neural Computation.

[41]  Ye Xu,et al.  Hyperlink Prediction in Hypernetworks Using Latent Social Features , 2013, Discovery Science.

[42]  Simon J. Godsill,et al.  Marginal maximum a posteriori estimation using Markov chain Monte Carlo , 2002, Stat. Comput..

[43]  Philip Wolfe,et al.  An algorithm for quadratic programming , 1956 .

[44]  Tamir Hazan,et al.  Norm-Product Belief Propagation: Primal-Dual Message-Passing for Approximate Inference , 2009, IEEE Transactions on Information Theory.

[45]  Rina Dechter,et al.  From Exact to Anytime Solutions for Marginal MAP , 2016, AAAI.

[46]  Denis Deratani Mauá,et al.  Anytime marginal maximum a posteriori inference , 2012, ICML 2012.

[47]  Patrice Marcotte,et al.  Some comments on Wolfe's ‘away step’ , 1986, Math. Program..

[48]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[49]  Ming-Hsuan Yang,et al.  Max-Margin Boltzmann Machines for Object Segmentation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Nikos Komodakis,et al.  MRF Optimization via Dual Decomposition: Message-Passing Revisited , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[51]  Bill Triggs,et al.  Scene Segmentation with CRFs Learned from Partially Labeled Images , 2007, NIPS.

[52]  Qiang Liu,et al.  Reasoning and Decisions in Probabilistic Graphical Models - A Unified Framework , 2014 .

[53]  Sekhar Tatikonda,et al.  Message-Passing Algorithms: Reparameterizations and Splittings , 2010, IEEE Transactions on Information Theory.

[54]  William T. Freeman,et al.  Latent hierarchical structural learning for object detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[55]  Julian Yarkony,et al.  Covering trees and lower-bounds on quadratic assignment , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[56]  Bart Selman,et al.  Solving Marginal MAP Problems with NP Oracles and Parity Constraints , 2016, NIPS.

[57]  Joris M. Mooij,et al.  libDAI: A Free and Open Source C++ Library for Discrete Approximate Inference in Graphical Models , 2010, J. Mach. Learn. Res..

[58]  Thorsten Joachims,et al.  Cutting-plane training of structural SVMs , 2009, Machine Learning.

[59]  Geoffrey E. Hinton,et al.  Conditional Restricted Boltzmann Machines for Structured Output Prediction , 2011, UAI.

[60]  Tomás Werner,et al.  A Linear Programming Approach to Max-Sum Problem: A Review , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[61]  Daphne Koller,et al.  Modeling Latent Variable Uncertainty for Loss-based Learning , 2012, ICML.

[62]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[63]  Wei Ping,et al.  Belief Propagation in Conditional RBMs for Structured Prediction , 2017, AISTATS.

[64]  Changhe Yuan,et al.  Annealed MAP , 2004, UAI.

[65]  Hilbert J. Kappen,et al.  On the properties of the Bethe approximation and loopy belief propagation on binary networks , 2004 .

[66]  Antonio Criminisi,et al.  Object categorization by learned universal visual dictionary , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[67]  Martin J. Wainwright,et al.  A new class of upper bounds on the log partition function , 2002, IEEE Transactions on Information Theory.

[68]  Ruslan Salakhutdinov,et al.  On the quantitative analysis of deep belief networks , 2008, ICML '08.

[69]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[70]  David A. Smith,et al.  Improving NLP through Marginalization of Hidden Syntactic Structure , 2012, EMNLP-CoNLL.

[71]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[72]  Tamir Hazan,et al.  A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction , 2010, NIPS.

[73]  Geoffrey E. Hinton,et al.  Restricted Boltzmann machines for collaborative filtering , 2007, ICML '07.

[74]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[75]  Brendan J. Frey,et al.  Factor graphs and the sum-product algorithm , 2001, IEEE Trans. Inf. Theory.

[76]  Frédo Durand,et al.  Efficient marginal likelihood optimization in blind deconvolution , 2011, CVPR 2011.

[77]  Geoffrey E. Hinton,et al.  Modeling Human Motion Using Binary Latent Variables , 2006, NIPS.

[78]  Rina Dechter Reasoning with Probabilistic and Deterministic Graphical Models: Exact Algorithms , 2013, Reasoning with Probabilistic and Deterministic Graphical Models: Exact Algorithms.

[79]  Yasubumi Sakakibara,et al.  RNA secondary structural alignment with conditional random fields , 2005, ECCB/JBI.

[80]  Changhe Yuan,et al.  Efficient Computation of Jointree Bounds for Systematic MAP Search , 2009, IJCAI.

[81]  Xinhua Zhang,et al.  Convex Two-Layer Modeling , 2013, NIPS.

[82]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[83]  Xin Li,et al.  Conditional Restricted Boltzmann Machines for Multi-label Learning with Incomplete Labels , 2015, AISTATS.

[84]  Rina Dechter,et al.  AND/OR Search for Marginal MAP , 2014, UAI.

[85]  Tommi S. Jaakkola,et al.  Tightening LP Relaxations for MAP using Message Passing , 2008, UAI.

[86]  Ian McGraw,et al.  Residual Belief Propagation: Informed Scheduling for Asynchronous Message Passing , 2006, UAI.

[87]  Ben J. A. Kröse,et al.  Efficient Greedy Learning of Gaussian Mixture Models , 2003, Neural Computation.

[88]  Rina Dechter,et al.  Anytime Anyspace AND/OR Search for Bounding the Partition Function , 2017, AAAI.

[89]  Tamir Hazan,et al.  Convergent Message-Passing Algorithms for Inference over General Graphs with Convex Free Energies , 2008, UAI.

[90]  Tommi S. Jaakkola,et al.  Approximate inference using conditional entropy decompositions , 2007, AISTATS.

[91]  Yang Wang,et al.  Max-margin hidden conditional random fields for human action recognition , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[92]  S. Sathiya Keerthi,et al.  Deterministic Annealing for Semi-Supervised Structured Output Learning , 2012, AISTATS.

[93]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[94]  Rahul G. Krishnan,et al.  Barrier Frank-Wolfe for Marginal Inference , 2015, NIPS.

[95]  Nicolas Le Roux,et al.  Convex Neural Networks , 2005, NIPS.

[96]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[97]  Eric T. Nalisnick,et al.  Under review as a conference paper at ICLR 2016 , 2015 .

[98]  Ilya Sutskever,et al.  On the Convergence Properties of Contrastive Divergence , 2010, AISTATS.

[99]  Xiaolong Wang,et al.  Protein-protein interaction site prediction based on conditional random fields , 2007, Bioinform..

[100]  Francis R. Bach,et al.  Breaking the Curse of Dimensionality with Convex Neural Networks , 2014, J. Mach. Learn. Res..

[101]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[102]  Nikos A. Vlassis,et al.  The global k-means clustering algorithm , 2003, Pattern Recognit..

[103]  William T. Freeman,et al.  Constructing free-energy approximations and generalized belief propagation algorithms , 2005, IEEE Transactions on Information Theory.

[104]  D. Sontag 1 Introduction to Dual Decomposition for Inference , 2010 .

[105]  Rahul Gupta,et al.  Accurate max-margin training for structured output spaces , 2008, ICML '08.

[106]  Tijmen Tieleman,et al.  Training restricted Boltzmann machines using approximations to the likelihood gradient , 2008, ICML '08.

[107]  Haipeng Luo,et al.  Online Gradient Boosting , 2015, NIPS.

[108]  Kevin Miller,et al.  Max-Margin Min-Entropy Models , 2012, AISTATS.

[109]  Marc Pollefeys,et al.  Efficient Structured Prediction with Latent Variables for General Graphical Models , 2012, ICML.

[110]  Yee Whye Teh,et al.  Bayesian Nonparametric Models , 2010, Encyclopedia of Machine Learning.

[111]  Nikos Komodakis,et al.  Beyond Loose LP-Relaxations: Optimizing MRFs by Repairing Cycles , 2008, ECCV.

[112]  Tommi S. Jaakkola,et al.  Learning Efficiently with Approximate Inference via Dual Losses , 2010, ICML.

[113]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[114]  Alan L. Yuille,et al.  The Concave-Convex Procedure , 2003, Neural Computation.

[115]  Nathan Ratliff,et al.  Online) Subgradient Methods for Structured Prediction , 2007 .

[116]  Trevor Darrell,et al.  Hidden Conditional Random Fields for Gesture Recognition , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[117]  Geoffrey E. Hinton,et al.  Factored 3-Way Restricted Boltzmann Machines For Modeling Natural Images , 2010, AISTATS.

[118]  Andrew McCallum,et al.  Introduction to Statistical Relational Learning , 2007 .

[119]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .

[120]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[121]  Sebastian Nowozin,et al.  Structured Prediction and Learning in Computer Vision , 2011 .

[122]  Vladlen Koltun,et al.  Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials , 2011, NIPS.

[123]  Joseph Gonzalez,et al.  Residual Splash for Optimally Parallelizing Belief Propagation , 2009, AISTATS.

[124]  Justin Domke Dual Decomposition for Marginal Inference , 2011, AAAI.

[125]  Martin Jaggi,et al.  Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization , 2013, ICML.