Reconciling meta-learning and continual learning with online mixtures of tasks

Learning-to-learn or meta-learning leverages data-driven inductive bias to increase the efficiency of learning on a novel task. This approach encounters difficulty when transfer is not advantageous, for instance, when tasks are considerably dissimilar or change over time. We use the connection between gradient-based meta-learning and hierarchical Bayes to propose a Dirichlet process mixture of hierarchical Bayesian models over the parameters of an arbitrary parametric model such as a neural network. In contrast to consolidating inductive biases into a single set of hyperparameters, our approach of task-dependent hyperparameter selection better handles latent distribution shift, as demonstrated on a set of evolving, image-based, few-shot learning benchmarks.

[1]  Michael I. Jordan,et al.  Variational inference for Dirichlet process mixtures , 2006 .

[2]  Bernhard Schölkopf,et al.  Discriminative k-shot learning using probabilistic models , 2017, ArXiv.

[3]  Stephen M. Smith,et al.  Segmentation of brain MR images through a hidden Markov random field model and the expectation-maximization algorithm , 2001, IEEE Transactions on Medical Imaging.

[4]  D. Sculley,et al.  Web-scale k-means clustering , 2010, WWW '10.

[5]  David Barber,et al.  Online Structured Laplace Approximations For Overcoming Catastrophic Forgetting , 2018, NeurIPS.

[6]  Anton Schwaighofer,et al.  Learning Gaussian processes from multiple tasks , 2005, ICML.

[7]  Thomas L. Griffiths,et al.  Recasting Gradient-Based Meta-Learning as Hierarchical Bayes , 2018, ICLR.

[8]  M. Escobar,et al.  Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[9]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[10]  Tsendsuren Munkhdalai,et al.  Metalearning with Hebbian Fast Weights , 2018, ArXiv.

[11]  Zoubin Ghahramani,et al.  Variational Inference for Bayesian Mixtures of Factor Analysers , 1999, NIPS.

[13]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[14]  Hugo Larochelle,et al.  Optimization as a Model for Few-Shot Learning , 2016, ICLR.

[15]  Alexis Boukouvalas,et al.  What to Do When K-Means Clustering Fails: A Simple yet Principled Alternative Algorithm , 2016, PloS one.

[16]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[17]  Jürgen Schmidhuber,et al.  Neural Expectation Maximization , 2017, NIPS.

[18]  Neil D. Lawrence,et al.  Learning to learn with the informative vector machine , 2004, ICML.

[19]  Adam J Rothman,et al.  Sparse Multivariate Regression With Covariance Estimation , 2010, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[20]  Anne G E Collins,et al.  Cognitive control over learning: creating, clustering, and generalizing task-set structure. , 2013, Psychological review.

[21]  H. Robbins A Stochastic Approximation Method , 1951 .

[22]  Jonathan Baxter,et al.  A Bayesian/Information Theoretic Model of Learning to Learn via Multiple Task Sampling , 1997, Machine Learning.

[23]  Samuel J. Gershman,et al.  A Tutorial on Bayesian Nonparametric Models , 2011, 1106.2697.

[24]  Svetha Venkatesh,et al.  Factorial Multi-Task Learning : A Bayesian Nonparametric Approach , 2013, ICML.

[25]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[26]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[27]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[28]  Thomas G. Dietterich,et al.  To transfer or not to transfer , 2005, NIPS 2005.

[29]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[30]  Max Welling,et al.  Bayesian k-Means as a Maximization-Expectation Algorithm , 2009, Neural Computation.

[31]  Nitish Srivastava,et al.  Discriminative Transfer Learning with Tree-based Priors , 2013, NIPS.

[32]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[33]  Tom Heskes,et al.  Solving a Huge Number of Similar Tasks: A Combination of Multi-Task Learning and a Hierarchical Bayesian Approach , 1998, ICML.

[34]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[35]  Surya Ganguli,et al.  Continual Learning Through Synaptic Intelligence , 2017, ICML.

[36]  Shannon L. Risacher,et al.  Sparse Bayesian multi-task learning for predicting cognitive outcomes from neuroimaging measures in Alzheimer's disease , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[38]  Sergey Levine,et al.  Probabilistic Model-Agnostic Meta-Learning , 2018, NeurIPS.

[39]  Jonathan Baxter,et al.  A Model of Inductive Bias Learning , 2000, J. Artif. Intell. Res..

[40]  Joan Bruna,et al.  Few-Shot Learning with Graph Neural Networks , 2017, ICLR.

[41]  Peter Müller,et al.  Issues in Bayesian Analysis of Neural Network Models , 1998, Neural Computation.

[42]  Rich Caruana,et al.  Multitask Learning: A Knowledge-Based Source of Inductive Bias , 1993, ICML.

[43]  Sebastian Nowozin,et al.  Meta-Learning Probabilistic Inference for Prediction , 2018, ICLR.

[44]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[45]  Sebastian Thrun,et al.  Discovering Structure in Multiple Learning Tasks: The TC Algorithm , 1996, ICML.

[46]  Hal Daumé,et al.  Bayesian Multitask Learning with Latent Hierarchies , 2009, UAI.

[47]  Jeff G. Schneider,et al.  Learning Multiple Tasks with a Sparse Matrix-Normal Penalty , 2010, NIPS.

[48]  Yoshua Bengio,et al.  The effects of negative adaptation in Model-Agnostic Meta-Learning , 2018, ArXiv.

[49]  J. Besag On the Statistical Analysis of Dirty Pictures , 1986 .

[50]  Jürgen Schmidhuber,et al.  Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks , 1992, Neural Computation.

[51]  N. Metropolis,et al.  The Monte Carlo method. , 1949 .

[52]  Radford M. Neal,et al.  A Split-Merge Markov chain Monte Carlo Procedure for the Dirichlet Process Mixture Model , 2004 .

[53]  Seungjin Choi,et al.  Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace , 2018, ICML.

[54]  Lawrence Carin,et al.  Multi-Task Learning for Classification with Dirichlet Process Priors , 2007, J. Mach. Learn. Res..

[55]  Richard E. Turner,et al.  Variational Continual Learning , 2017, ICLR.

[56]  Hong Yu,et al.  Meta Networks , 2017, ICML.

[57]  Nikos Komodakis,et al.  Dynamic Few-Shot Visual Learning Without Forgetting , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[58]  Dahua Lin,et al.  Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation , 2013, NIPS.

[59]  Jiawei Han,et al.  Knowledge transfer via multiple model local structure mapping , 2008, KDD.

[60]  Tom Heskes,et al.  Task Clustering and Gating for Bayesian Multitask Learning , 2003, J. Mach. Learn. Res..

[61]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[62]  Raquel Urtasun,et al.  Few-Shot Learning Through an Information Retrieval Lens , 2017, NIPS.

[63]  Pieter Abbeel,et al.  A Simple Neural Attentive Meta-Learner , 2017, ICLR.

[64]  Carl E. Rasmussen,et al.  The Infinite Gaussian Mixture Model , 1999, NIPS.

[65]  Emily B. Fox,et al.  Effective Split-Merge Monte Carlo Methods for Nonparametric Models of Sequential Data , 2012, NIPS.

[66]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[67]  Alexandre Lacoste,et al.  TADAM: Task dependent adaptive metric for improved few-shot learning , 2018, NeurIPS.

[68]  Emily B. Fox,et al.  Streaming Variational Inference for Bayesian Nonparametric Mixture Models , 2014, AISTATS.

[69]  Joshua B. Tenenbaum,et al.  Learning with Hierarchical-Deep Models , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[70]  Dit-Yan Yeung,et al.  A Regularization Approach to Learning Task Relationships in Multitask Learning , 2014, ACM Trans. Knowl. Discov. Data.

[71]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.