Information-Theoretic Generalization Bounds for Meta-Learning and Applications

Meta-learning, or “learning to learn”, refers to techniques that infer an inductive bias from data corresponding to multiple related tasks with the goal of improving the sample efficiency for new, previously unobserved, tasks. A key performance measure for meta-learning is the meta-generalization gap, that is, the difference between the average loss measured on the meta-training data and on a new, randomly selected task. This paper presents novel information-theoretic upper bounds on the meta-generalization gap. Two broad classes of meta-learning algorithms are considered that use either separate within-task training and test sets, like model agnostic meta-learning (MAML), or joint within-task training and test sets, like reptile. Extending the existing work for conventional learning, an upper bound on the meta-generalization gap is derived for the former class that depends on the mutual information (MI) between the output of the meta-learning algorithm and its input meta-training data. For the latter, the derived bound includes an additional MI between the output of the per-task learning procedure and corresponding data set to capture within-task uncertainty. Tighter bounds are then developed for the two classes via novel individual task MI (ITMI) bounds. Applications of the derived bounds are finally discussed, including a broad class of noisy iterative algorithms for meta-learning.

[1]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[2]  T. Poggio,et al.  General conditions for predictivity in learning theory , 2004, Nature.

[3]  Maxim Raginsky,et al.  Information-theoretic analysis of stability and bias of learning algorithms , 2016, 2016 IEEE Information Theory Workshop (ITW).

[4]  Sebastian Thrun,et al.  Is Learning The n-th Thing Any Easier Than Learning The First? , 1995, NIPS.

[5]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[6]  W. Marsden I and J , 2012 .

[7]  Hongyuan Zha,et al.  Learning Granger Causality for Hawkes Processes , 2016, ICML.

[8]  Ohad Shamir,et al.  Learnability, Stability and Uniform Convergence , 2010, J. Mach. Learn. Res..

[9]  J. V. Michalowicz,et al.  Handbook of Differential Entropy , 2013 .

[10]  Martin J. Wainwright,et al.  High-Dimensional Statistics , 2019 .

[11]  Yanjun Han,et al.  Dependence measures bounding the exploration bias for general measurements , 2016, 2017 IEEE International Symposium on Information Theory (ISIT).

[12]  Varun Jog,et al.  Generalization Error Bounds for Noisy, Iterative Algorithms , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[13]  Massimiliano Pontil,et al.  The Advantage of Conditional Meta-Learning for Biased Regularization and Fine-Tuning , 2020, NeurIPS.

[14]  Theodoros Damoulas,et al.  Generalized Variational Inference , 2019, ArXiv.

[15]  Sebastian Thrun,et al.  Learning to Learn: Introduction and Overview , 1998, Learning to Learn.

[16]  Kumar Chellapilla,et al.  Personalized handwriting recognition via biased regularization , 2006, ICML.

[17]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[18]  Joshua Achiam,et al.  On First-Order Meta-Learning Algorithms , 2018, ArXiv.

[19]  Maxim Raginsky,et al.  Information-theoretic analysis of generalization capability of learning algorithms , 2017, NIPS.

[20]  Andreas Krause,et al.  PACOH: Bayes-Optimal Meta-Learning with PAC-Guarantees , 2020, ICML.

[21]  J. Schulman,et al.  Reptile: a Scalable Metalearning Algorithm , 2018 .

[22]  Jonathan Baxter,et al.  A Model of Inductive Bias Learning , 2000, J. Artif. Intell. Res..

[23]  Ron Meir,et al.  Meta-Learning by Adjusting Priors Based on Extended PAC-Bayes Theory , 2017, ICML.

[24]  Sergio Verdú,et al.  Chaining Mutual Information and Tightening Generalization Bounds , 2018, NeurIPS.

[25]  Ilja Kuzborskij,et al.  Fast rates by transferring from auxiliary hypotheses , 2014, Machine Learning.

[26]  Gintare Karolina Dziugaite,et al.  Information-Theoretic Generalization Bounds for SGLD via Data-Dependent Estimates , 2019, NeurIPS.

[27]  Chelsea Finn,et al.  Meta-Learning without Memorization , 2020, ICLR.

[28]  M. Kearns,et al.  Algorithmic stability and sanity-check bounds for leave-one-out cross-validation , 1999 .

[29]  Massimiliano Pontil,et al.  Learning-to-Learn Stochastic Gradient Descent with Biased Regularization , 2019, ICML.

[30]  Matthias W. Seeger,et al.  PAC-Bayesian Generalisation Error Bounds for Gaussian Process Classification , 2003, J. Mach. Learn. Res..

[31]  Christoph H. Lampert,et al.  A PAC-Bayesian bound for Lifelong Learning , 2013, ICML.

[32]  Massimiliano Pontil,et al.  Incremental Learning-to-Learn with Statistical Guarantees , 2018, UAI.

[33]  Pier Giovanni Bissiri,et al.  A general framework for updating belief distributions , 2013, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[34]  Toniann Pitassi,et al.  Preserving Statistical Validity in Adaptive Data Analysis , 2014, STOC.

[35]  David A. McAllester PAC-Bayesian model averaging , 1999, COLT '99.

[36]  Ibrahim M. Alabdulmohsin,et al.  Towards a Unified Theory of Learning and Information , 2020, Entropy.

[37]  Ricardo Vilalta,et al.  A Perspective View and Survey of Meta-Learning , 2002, Artificial Intelligence Review.

[38]  Osvaldo Simeone,et al.  A Brief Introduction to Machine Learning for Engineers , 2017, Found. Trends Signal Process..

[39]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[40]  Partha Niyogi,et al.  Almost-everywhere Algorithmic Stability and Generalization Error , 2002, UAI.

[41]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[42]  Thomas Steinke,et al.  Reasoning About Generalization via Conditional Mutual Information , 2020, COLT.

[43]  V. Koltchinskii,et al.  Rademacher Processes and Bounding the Risk of Function Learning , 2004, math/0405338.

[44]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[45]  Theodoros Damoulas,et al.  Generalized Variational Inference: Three arguments for deriving new Posteriors , 2019 .

[46]  James Zou,et al.  Controlling Bias in Adaptive Data Analysis Using Information Theory , 2015, AISTATS.

[47]  Pierre Alquier,et al.  On the properties of variational approximations of Gibbs posteriors , 2015, J. Mach. Learn. Res..

[48]  Massimiliano Pontil,et al.  The Benefit of Multitask Representation Learning , 2015, J. Mach. Learn. Res..

[49]  Luc Devroye,et al.  Distribution-free performance bounds for potential function rules , 1979, IEEE Trans. Inf. Theory.

[50]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[51]  Joonhyuk Kang,et al.  From Learning to Meta-Learning: Reduced Training Overhead and Complexity for Communication Systems , 2020, 2020 2nd 6G Wireless Summit (6G SUMMIT).

[52]  Andreas Maurer,et al.  Algorithmic Stability and Meta-Learning , 2005, J. Mach. Learn. Res..

[53]  W. Rogers,et al.  A Finite Sample Distribution-Free Performance Bound for Local Discrimination Rules , 1978 .

[54]  Shaofeng Zou,et al.  Tightening Mutual Information Based Bounds on Generalization Error , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[55]  Michael Gastpar,et al.  Strengthened Information-theoretic Bounds on the Generalization Error , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[56]  Raef Bassily,et al.  Algorithmic stability for adaptive data analysis , 2015, STOC.

[57]  Michael Gastpar,et al.  Computable Bounds on the Exploration Bias , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).