Partitioned Variational Inference: A unified framework encompassing federated and continual learning

Variational inference (VI) has become the method of choice for fitting many modern probabilistic models. However, practitioners are faced with a fragmented literature that offers a bewildering array of algorithmic options. First, the variational family. Second, the granularity of the updates e.g. whether the updates are local to each data point and employ message passing or global. Third, the method of optimization (bespoke or blackbox, closed-form or stochastic updates, etc.). This paper presents a new framework, termed Partitioned Variational Inference (PVI), that explicitly acknowledges these algorithmic dimensions of VI, unifies disparate literature, and provides guidance on usage. Crucially, the proposed PVI framework allows us to identify new ways of performing VI that are ideally suited to challenging learning scenarios including federated learning (where distributed computing is leveraged to process non-centralized data) and continual learning (where new data and tasks arrive over time and must be accommodated quickly). We showcase these new capabilities by developing communication-efficient federated training of Bayesian neural networks and continual learning for Gaussian process models with private pseudo-points. The new methods significantly outperform the state-of-the-art, whilst being almost as straightforward to implement as standard VI.

[1]  Bo Wang,et al.  Lack of Consistency of Mean Field and Variational Bayes Approximations for State Space Models , 2004, Neural Processing Letters.

[2]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[3]  James Hensman,et al.  Natural Gradients in Practice: Non-Conjugate Variational Inference in Gaussian Process Models , 2018, AISTATS.

[4]  Roni Khardon,et al.  A Fixed-Point Operator for Inference in Variational Bayesian Latent Gaussian Models , 2016, AISTATS.

[5]  Richard E. Turner,et al.  A Unifying Framework for Gaussian Process Pseudo-Point Approximations using Power Expectation Propagation , 2016, J. Mach. Learn. Res..

[6]  Tom Minka,et al.  Non-conjugate Variational Message Passing for Multinomial and Binary Regression , 2011, NIPS.

[7]  Julien Cornebise,et al.  Weight Uncertainty in Neural Network , 2015, ICML.

[8]  Richard S. Sutton,et al.  Online Learning with Random Representations , 1993, ICML.

[9]  Charles M. Bishop,et al.  Variational Message Passing , 2005, J. Mach. Learn. Res..

[10]  Neil D. Lawrence,et al.  Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data , 2003, NIPS.

[11]  J. Cunningham,et al.  Expectation Propagation as a Way of Life , 2020 .

[12]  Antti Honkela,et al.  On-line Variational Bayesian Learning , 2003 .

[13]  Sayan Mukherjee,et al.  The Information Geometry of Mirror Descent , 2013, IEEE Transactions on Information Theory.

[14]  C. Archambeau,et al.  Incremental Variational Inference for Latent Dirichlet Allocation , 2015, 1507.05016.

[15]  Yue Zhao,et al.  Federated Learning with Non-IID Data , 2018, ArXiv.

[16]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[17]  Neil D. Lawrence,et al.  Gaussian Processes for Big Data , 2013, UAI.

[18]  Richard E. Turner,et al.  Two problems with variational expectation maximisation for time-series models , 2011 .

[19]  T. Jaakkola,et al.  Improving the Mean Field Approximation Via the Use of Mixture Distributions , 1999, Learning in Graphical Models.

[20]  Michael McCloskey,et al.  Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , 1989 .

[21]  Fabio Tozeto Ramos,et al.  Gaussian process occupancy maps* , 2012, Int. J. Robotics Res..

[22]  Nathan D. Cahill,et al.  New Metrics and Experimental Paradigms for Continual Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[23]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[24]  Matthew D. Hoffman,et al.  A trust-region method for stochastic variational inference with applications to streaming data , 2015, ICML.

[25]  Sebastian Nowozin,et al.  Adversarial Variational Bayes: Unifying Variational Autoencoders and Generative Adversarial Networks , 2017, ICML.

[26]  Arnaud Doucet,et al.  Sequential Monte Carlo Methods to Train Neural Network Models , 2000, Neural Computation.

[27]  Michalis K. Titsias,et al.  Variational Learning of Inducing Variables in Sparse Gaussian Processes , 2009, AISTATS.

[28]  Davide Maltoni,et al.  CORe50: a New Dataset and Benchmark for Continuous Object Recognition , 2017, CoRL.

[29]  Xiangyu Wang,et al.  Parallelizing MCMC via Weierstrass Sampler , 2013, 1312.4605.

[30]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[31]  Ariel D. Procaccia,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[32]  Yee Whye Teh,et al.  Distributed Bayesian Learning with Stochastic Natural Gradient Expectation Propagation and the Posterior Server , 2015, J. Mach. Learn. Res..

[33]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[34]  Thomas P. Minka,et al.  Probabilistic Programming with Infer.NET , 2017 .

[35]  Max Welling,et al.  Markov Chain Monte Carlo and Variational Inference: Bridging the Gap , 2014, ICML.

[36]  William T. Freeman,et al.  Nonparametric belief propagation , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[37]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[38]  Andre Wibisono,et al.  Streaming Variational Bayes , 2013, NIPS.

[39]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[40]  James Hensman,et al.  Scalable Variational Gaussian Process Classification , 2014, AISTATS.

[41]  R Ratcliff,et al.  Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. , 1990, Psychological review.

[42]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[43]  James Hensman,et al.  On Sparse Variational Methods and the Kullback-Leibler Divergence between Stochastic Processes , 2015, AISTATS.

[44]  Pascal Fua,et al.  Kullback-Leibler Proximal Variational Inference , 2015, NIPS.

[45]  Philip H. S. Torr,et al.  Riemannian Walk for Incremental Learning: Understanding Forgetting and Intransigence , 2018, ECCV.

[46]  Radford M. Neal Bayesian Learning via Stochastic Dynamics , 1992, NIPS.

[47]  M. Opper Sparse Online Gaussian Processes , 2008 .

[48]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[49]  Dustin Tran,et al.  Hierarchical Variational Models , 2015, ICML.

[50]  Han Liu,et al.  Continual Learning in Generative Adversarial Nets , 2017, ArXiv.

[51]  Matthew J. Beal Variational algorithms for approximate Bayesian inference , 2003 .

[52]  Samy Bengio,et al.  Revisiting Distributed Synchronous SGD , 2016, ArXiv.

[53]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[54]  Dustin Tran,et al.  Automatic Differentiation Variational Inference , 2016, J. Mach. Learn. Res..

[55]  Shakir Mohamed,et al.  Variational Inference with Normalizing Flows , 2015, ICML.

[56]  R. Khardon,et al.  Monte Carlo Structured SVI for Non-Conjugate Models , 2016 .

[57]  Mohammad Emtiyaz Khan,et al.  Conjugate-Computation Variational Inference: Converting Variational Inference in Non-Conjugate Models to Inferences in Conjugate Models , 2017, AISTATS.

[58]  R. Khardon,et al.  Monte Carlo Structured SVI for Two-Level Non-Conjugate Models , 2016, 1612.03957.

[59]  Sharad Singhal,et al.  Training Multilayer Perceptrons with the Extende Kalman Algorithm , 1988, NIPS.

[60]  Tom Minka,et al.  Expectation Propagation for approximate Bayesian inference , 2001, UAI.

[61]  Alex Graves,et al.  Practical Variational Inference for Neural Networks , 2011, NIPS.

[62]  Edward I. George,et al.  Bayes and big data: the consensus Monte Carlo algorithm , 2016, Big Data and Information Theory.

[63]  Tim Salimans,et al.  Fixed-Form Variational Posterior Approximation through Stochastic Linear Regression , 2012, ArXiv.

[64]  Duy Nguyen-Tuong,et al.  Local Gaussian Process Regression for Real Time Online Model Learning , 2008, NIPS.

[65]  Surya Ganguli,et al.  Continual Learning Through Synaptic Intelligence , 2017, ICML.

[66]  Roni Khardon,et al.  Sparse Variational Inference for Generalized GP Models , 2015, ICML.

[67]  Yoshua Bengio,et al.  An Empirical Investigation of Catastrophic Forgeting in Gradient-Based Neural Networks , 2013, ICLR.

[68]  Richard E. Turner,et al.  Streaming Sparse Gaussian Process Approximations , 2017, NIPS.

[69]  Derek Hoiem,et al.  Learning without Forgetting , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[70]  Richard E. Turner,et al.  Stochastic Expectation Propagation , 2015, NIPS.

[71]  Charles M. Bishop,et al.  Ensemble learning in Bayesian neural networks , 1998 .

[72]  Richard E. Turner,et al.  Variational Continual Learning , 2017, ICLR.

[73]  Alexis Boukouvalas,et al.  GPflow: A Gaussian Process Library using TensorFlow , 2016, J. Mach. Learn. Res..

[74]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[75]  Masa-aki Sato,et al.  Online Model Selection Based on the Variational Bayes , 2001, Neural Computation.

[76]  Yann LeCun,et al.  Deep learning with Elastic Averaging SGD , 2014, NIPS.

[77]  Douglas H. Fisher,et al.  A Case Study of Incremental Concept Induction , 1986, AAAI.

[78]  Elad Hoffer,et al.  Bayesian Gradient Descent: Online Variational Bayes Learning with Increased Robustness to Catastrophic Forgetting and Weight Pruning , 2018, ArXiv.

[79]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[80]  Sarvar Patel,et al.  Practical Secure Aggregation for Privacy-Preserving Machine Learning , 2017, IACR Cryptol. ePrint Arch..

[81]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[82]  Richard E. Turner,et al.  Black-box α-divergence minimization , 2016, ICML 2016.

[83]  Jiwon Kim,et al.  Continual Learning with Deep Generative Replay , 2017, NIPS.

[84]  Volker Tresp,et al.  A Bayesian Committee Machine , 2000, Neural Computation.

[85]  Mark W. Schmidt,et al.  Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions , 2015, UAI.

[86]  Neil D. Lawrence,et al.  Fast Variational Inference in the Conjugate Exponential Family , 2012, NIPS.

[87]  Marc Peter Deisenroth,et al.  Efficient reinforcement learning using Gaussian processes , 2010 .

[88]  Matthew P. Wand,et al.  Fully simplified multivariate normal updates in non-conjugate variational message passing , 2014, J. Mach. Learn. Res..

[89]  Juha Karhunen,et al.  Approximate Riemannian Conjugate Gradient Learning for Fixed-Form Variational Bayes , 2010, J. Mach. Learn. Res..

[90]  Christopher Summerfield,et al.  Comparing continual task learning in minds and machines , 2018, Proceedings of the National Academy of Sciences.

[91]  Ryan P. Adams,et al.  Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks , 2015, ICML.

[92]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[93]  Carl E. Rasmussen,et al.  A Unifying View of Sparse Approximate Gaussian Process Regression , 2005, J. Mach. Learn. Res..

[94]  Michael I. Jordan,et al.  Ray: A Distributed Framework for Emerging AI Applications , 2017, OSDI.

[95]  Dustin Tran,et al.  Deep Probabilistic Programming , 2017, ICLR.