Bayesian inference with posterior regularization and applications to infinite latent SVMs

Existing Bayesian models, especially nonparametric Bayesian methods, rely on specially conceived priors to incorporate domain knowledge for discovering improved latent representations. While priors can affect posterior distributions through Bayes' rule, imposing posterior regularization is arguably more direct and in some cases more natural and general. In this paper, we present regularized Bayesian inference (RegBayes), a novel computational framework that performs posterior inference with a regularization term on the desired post-data posterior distribution under an information theoretical formulation. RegBayes is more flexible than the procedure that elicits expert knowledge via priors, and it covers both directed Bayesian networks and undirected Markov networks whose Bayesian formulation results in hybrid chain graph models. When the regularization is induced from a linear operator on the posterior distributions, such as the expectation operator, we present a general convex-analysis theorem to characterize the solution of RegBayes. Furthermore, we present two concrete examples of RegBayes, infinite latent support vector machines (iLSVM) and multi-task infinite latent support vector machines (MT-iLSVM), which explore the large-margin idea in combination with a nonparametric Bayesian model for discovering predictive latent features for classification and multi-task learning, respectively. We present efficient inference methods and report empirical studies on several benchmark datasets, which appear to demonstrate the merits inherited from both large-margin learning and Bayesian nonparametrics. Such results were not available until now, and contribute to push forward the interface between these two important subfields, which have been largely treated as isolated in the community.

[1]  C. Robert Simulation of truncated normal variables , 2009, 0907.4010.

[2]  Fuchun Sun,et al.  Large-Margin Predictive Latent Subspace Learning for Multiview Data Analysis , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Thomas L. Griffiths,et al.  Nonparametric Latent Feature Models for Link Prediction , 2009, NIPS.

[4]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[5]  D. Blei Bayesian Nonparametrics I , 2016 .

[6]  Ben Taskar,et al.  Expectation Maximization and Posterior Constraints , 2007, NIPS.

[7]  Yuan Qi,et al.  Bayesian Conditional Random Fields , 2005, AISTATS.

[8]  B. Clarke,et al.  Pushing the Limits of Contemporary Statistics: Contributions in Honor of Jayanta K. Ghosh , 2008, 0806.4445.

[9]  Hal Daumé,et al.  Infinite Predictor Subspace Models for Multitask Learning , 2010, AISTATS.

[10]  Tom Heskes,et al.  Task Clustering and Gating for Bayesian Multitask Learning , 2003, J. Mach. Learn. Res..

[11]  Gideon S. Mann,et al.  Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data , 2010, J. Mach. Learn. Res..

[12]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[13]  Andrew McCallum,et al.  Alternating Projections for Learning with Expectation Constraints , 2009, UAI.

[14]  Geoffrey E. Hinton,et al.  Exponential Family Harmoniums with an Application to Information Retrieval , 2004, NIPS.

[15]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[16]  Bo Zhang,et al.  Bayesian Nonparametric Maximum Margin Matrix Factorization for Collaborative Prediction , 2012, NIPS 2012.

[17]  P. M. Williams Bayesian Conditionalisation and the Principle of Minimum Information , 1980, The British Journal for the Philosophy of Science.

[18]  Massimiliano Pontil,et al.  Convex multi-task feature learning , 2008, Machine Learning.

[19]  Peter I. Frazier,et al.  Distance dependent Chinese restaurant processes , 2009, ICML.

[20]  Edwin V. Bonilla,et al.  Multi-task Gaussian Process Prediction , 2007, NIPS.

[21]  Tommi S. Jaakkola,et al.  Maximum Entropy Discrimination , 1999, NIPS.

[22]  Zoubin Ghahramani,et al.  Bayesian Learning in Undirected Graphical Models: Approximate MCMC Algorithms , 2004, UAI.

[23]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[24]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[25]  Ning Chen,et al.  Infinite Latent SVM for Classification and Multi-task Learning , 2011, NIPS.

[26]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[27]  A. O'Hagan,et al.  Statistical Methods for Eliciting Probability Distributions , 2005 .

[28]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[29]  David B. Dunson,et al.  The matrix stick-breaking process for flexible multi-task learning , 2007, ICML '07.

[30]  Yee Whye Teh,et al.  Stick-breaking Construction for the Indian Buffet Process , 2007, AISTATS.

[31]  Peter Orbanz,et al.  Nonparametric Priors on Complete Separable Metric Spaces , 2012 .

[32]  Max Welling,et al.  Bayesian Random Fields: The Bethe-Laplace Approximation , 2006, UAI.

[33]  Miroslav Dudík,et al.  Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling , 2007, J. Mach. Learn. Res..

[34]  Eric P. Xing,et al.  MedLDA: maximum margin supervised topic models for regression and classification , 2009, ICML '09.

[35]  L. Wasserman,et al.  The consistency of posterior distributions in nonparametric problems , 1999 .

[36]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[37]  Peter D. Hoff,et al.  Bayesian methods for partial stochastic orderings , 2003 .

[38]  Ning Chen,et al.  Infinite SVM: a Dirichlet Process Mixture of Large-margin Kernel Machines , 2011, ICML.

[39]  Finale Doshi-Velez,et al.  The Indian Buffet Process: Scalable Inference and Extensions , 2009 .

[40]  M. Frydenberg The chain graph Markov property , 1990 .

[41]  Thomas L. Griffiths,et al.  Infinite latent feature models and the Indian buffet process , 2005, NIPS.

[42]  Stephen E. Fienberg,et al.  Discriminative Topic Modeling Based on Manifold Learning , 2012, TKDD.

[43]  Mohammad Emtiyaz Khan,et al.  Variational bounds for mixed-data factor analysis , 2010, NIPS.

[44]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[45]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[46]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[47]  Alex Pentland,et al.  Discriminative, generative and imitative learning , 2002 .

[48]  Ning Chen,et al.  Predictive Subspace Learning for Multi-view Data: a Large Margin Approach , 2010, NIPS.

[49]  Jun Zhu,et al.  Online Nonparametric Max-Margin Matrix Factorization for Collaborative Prediction , 2012, 2014 IEEE International Conference on Data Mining.

[50]  Igor Vajda,et al.  On Bregman Distances and Divergences of Probability Measures , 2012, IEEE Transactions on Information Theory.

[51]  Kazufumi Ito,et al.  Lagrange multiplier approach to variational problems and applications , 2008, Advances in design and control.

[52]  R. Ramamoorthi,et al.  Remarks on consistency of posterior distributions , 2008, 0805.3248.

[53]  Jun Zhu,et al.  Robust RegBayes: Selectively Incorporating First-Order Logic Domain Knowledge into Bayesian Models , 2014, ICML.

[54]  Ben Taskar,et al.  Posterior Regularization for Structured Latent Variable Models , 2010, J. Mach. Learn. Res..

[55]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[56]  Massimiliano Pontil,et al.  Multi-Task Feature Learning , 2006, NIPS.

[57]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[58]  Dan Klein,et al.  Learning from measurements in exponential families , 2009, ICML '09.

[59]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[60]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[61]  J. Borwein,et al.  Techniques of variational analysis , 2005 .

[62]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[63]  Thomas L. Magnanti,et al.  Fenchel and Lagrange duality are equivalent , 1974, Math. Program..

[64]  Carl E. Rasmussen,et al.  Infinite Mixtures of Gaussian Process Experts , 2001, NIPS.

[65]  Carl E. Rasmussen,et al.  Factorial Hidden Markov Models , 1997 .

[66]  D. Dunson,et al.  Bayesian nonparametric inference on stochastic ordering. , 2008, Biometrika.

[67]  Zoubin Ghahramani,et al.  Dependent Indian Buffet Processes , 2010, AISTATS.

[68]  Tony Jebara,et al.  Multitask Sparsity via Maximum Entropy Discrimination , 2011, J. Mach. Learn. Res..

[69]  Fuchun Sun,et al.  Learning Harmonium Models With Infinite Latent Features , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[70]  Max Welling,et al.  Exchangeable inconsistent priors for Bayesian posterior inference , 2012, 2012 Information Theory and Applications Workshop.

[71]  Dit-Yan Yeung,et al.  A Convex Formulation for Learning Task Relationships in Multi-Task Learning , 2010, UAI.

[72]  Maosong Sun,et al.  Monte Carlo Methods for Maximum Margin Supervised Topic Models , 2012, NIPS.

[73]  Michael I. Jordan,et al.  Hierarchical Beta Processes and the Indian Buffet Process , 2007, AISTATS.

[74]  Jun Zhu,et al.  Maximum Entropy Discrimination Markov Networks , 2009, J. Mach. Learn. Res..

[75]  Ming-Wei Chang,et al.  Guiding Semi-Supervision with Constraint-Driven Learning , 2007, ACL.

[76]  Alexander J. Smola,et al.  Unifying Divergence Minimization and Statistical Inference Via Convex Duality , 2006, COLT.