Incremental Learning-to-Learn with Statistical Guarantees

In learning-to-learn the goal is to infer a learning algorithm that works well on a class of tasks sampled from an unknown meta distribution. In contrast to previous work on batch learning-to-learn, we consider a scenario where tasks are presented sequentially and the algorithm needs to adapt incrementally to improve its performance on future tasks. Key to this setting is for the algorithm to rapidly incorporate new observations into the model as they arrive, without keeping them in memory. We focus on the case where the underlying algorithm is ridge regression parameterized by a positive semidefinite matrix. We propose to learn this matrix by applying a stochastic strategy to minimize the empirical error incurred by ridge regression on future tasks sampled from the meta distribution. We study the statistical properties of the proposed algorithm and prove non-asymptotic bounds on its excess transfer risk, that is, the generalization performance on new tasks from the same meta distribution. We compare our online learning-to-learn approach with a state of the art batch method, both theoretically and empirically.

[1]  Lorenzo Rosasco,et al.  Convex Learning of Multiple Tasks and their Structure , 2015, ICML.

[2]  Elad Hazan,et al.  Introduction to Online Convex Optimization , 2016, Found. Trends Optim..

[3]  Mark Herbster,et al.  Mistake Bounds for Binary Matrix Completion , 2016, NIPS.

[4]  Jean-Philippe Vert,et al.  Clustered Multi-Task Learning: A Convex Formulation , 2008, NIPS.

[5]  Lorenzo Rosasco,et al.  Learning multiple visual tasks while discovering their structure , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Andreas Maurer,et al.  Algorithmic Stability and Meta-Learning , 2005, J. Mach. Learn. Res..

[7]  Kaare Brandt Petersen,et al.  The Matrix Cookbook , 2006 .

[8]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[9]  Jonathan Baxter,et al.  A Model of Inductive Bias Learning , 2000, J. Artif. Intell. Res..

[10]  Massimiliano Pontil,et al.  New Perspectives on k-Support and Cluster Norms , 2014, J. Mach. Learn. Res..

[11]  Andreas Maurer,et al.  Transfer bounds for linear feature learning , 2009, Machine Learning.

[12]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[13]  Ali Esmaili,et al.  Probability and Random Processes , 2005, Technometrics.

[14]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[15]  A. Banerjee Convex Analysis and Optimization , 2006 .

[16]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[17]  Patrick L. Combettes,et al.  Signal Recovery by Proximal Forward-Backward Splitting , 2005, Multiscale Model. Simul..

[18]  Paolo Frasconi,et al.  Bilevel Programming for Hyperparameter Optimization and Meta-Learning , 2018, ICML.

[19]  Christoph H. Lampert,et al.  A PAC-Bayesian bound for Lifelong Learning , 2013, ICML.

[20]  H. Robbins A Stochastic Approximation Method , 1951 .

[21]  Sebastian Thrun,et al.  Learning to Learn , 1998, Springer US.

[22]  Massimiliano Pontil,et al.  Excess risk bounds for multitask learning with trace norm regularization , 2012, COLT.

[23]  Yiannis Demiris,et al.  A morphable template framework for robot learning by demonstration: Integrating one-shot and incremental learning approaches , 2014, Robotics Auton. Syst..

[24]  Massimiliano Pontil,et al.  Sparse coding for multitask and transfer learning , 2012, ICML.

[25]  Pierre Alquier,et al.  Regret Bounds for Lifelong Learning , 2016, AISTATS.

[26]  Gábor Lugosi,et al.  Concentration Inequalities , 2008, COLT.

[27]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[28]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[29]  Massimiliano Pontil,et al.  The Benefit of Multitask Representation Learning , 2015, J. Mach. Learn. Res..

[30]  Giorgio Metta,et al.  Incremental robot learning of new objects with fixed update time , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[31]  Ohad Shamir,et al.  Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes , 2012, ICML.

[32]  Elad Hazan,et al.  Projection-free Online Learning , 2012, ICML.

[33]  Heinz H. Bauschke,et al.  Convex Analysis and Monotone Operator Theory in Hilbert Spaces , 2011, CMS Books in Mathematics.

[34]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[35]  Bernt Schiele,et al.  Transfer Learning in a Transductive Setting , 2013, NIPS.

[36]  Santosh S. Vempala,et al.  Efficient Representations for Lifelong Learning and Autoencoding , 2014, COLT.

[37]  Claudio Gentile,et al.  Linear Algorithms for Online Multitask Classification , 2010, COLT.

[38]  Eric Eaton,et al.  ELLA: An Efficient Lifelong Learning Algorithm , 2013, ICML.

[39]  Massimiliano Pontil,et al.  Convex multi-task feature learning , 2008, Machine Learning.

[40]  Christoph H. Lampert,et al.  iCaRL: Incremental Classifier and Representation Learning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[42]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.