Multi-Environment Meta-Learning in Stochastic Linear Bandits

In this work we investigate meta-learning (or learning-to-learn) approaches in multi-task linear stochastic bandit problems that can originate from multiple environments. Inspired by the work of [1] on meta-learning in a sequence of linear bandit problems whose parameters are sampled from a single distribution (i.e., a single environment), here we consider the feasibility of meta-learning when task parameters are drawn from a mixture distribution instead. For this problem, we propose a regularized version of the OFUL algorithm that, when trained on tasks with labeled environments, achieves low regret on a new task without requiring knowledge of the environment from which the new task originates. Specifically, our regret bound for the new algorithm captures the effect of environment misclassification and highlights the benefits over learning each task separately or meta-learning without recognition of the distinct mixture components.

[1]  B. Kveton,et al.  Hierarchical Bayesian Bandits , 2021, AISTATS.

[2]  B. Kveton,et al.  Thompson Sampling with a Mixture Prior , 2021, AISTATS.

[3]  Csaba Szepesvari,et al.  Meta-Thompson Sampling , 2021, ICML.

[4]  Wei Hu,et al.  Provable Benefits of Representation Learning in Linear Bandits , 2020, ArXiv.

[5]  Christos Thrampoulidis,et al.  Stage-wise Conservative Linear Bandits , 2020, NeurIPS.

[6]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[7]  Alessandro Lazaric,et al.  Meta-learning with Stochastic Linear Bandits , 2020, ICML.

[8]  Christos Thrampoulidis,et al.  Linear Thompson Sampling Under Unknown Linear Constraints , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Sham M. Kakade,et al.  Few-Shot Learning via Learning the Representation, Provably , 2020, ICLR.

[10]  M. Alizadeh,et al.  Safe Linear Thompson Sampling With Side Information , 2019, IEEE Transactions on Signal Processing.

[11]  Maria-Florina Balcan,et al.  Adaptive Gradient-Based Meta-Learning Methods , 2019, NeurIPS.

[12]  Subhransu Maji,et al.  Meta-Learning With Differentiable Convex Optimization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Massimiliano Pontil,et al.  Learning-to-Learn Stochastic Gradient Descent with Biased Regularization , 2019, ICML.

[14]  Joaquin Vanschoren,et al.  Meta-Learning: A Survey , 2018, Automated Machine Learning.

[15]  Yu Zhang,et al.  Transferable Contextual Bandit for Cross-Domain Recommendation , 2018, AAAI.

[16]  Ürün Dogan,et al.  Multi-Task Learning for Contextual Bandits , 2017, NIPS.

[17]  Elias Bareinboim,et al.  Transfer Learning in Multi-Armed Bandit: A Causal Approach , 2017, AAMAS.

[18]  Alessandro Lazaric,et al.  Linear Thompson Sampling Revisited , 2016, AISTATS.

[19]  Pierre Alquier,et al.  Regret Bounds for Lifelong Learning , 2016, AISTATS.

[20]  Massimiliano Pontil,et al.  The Benefit of Multitask Representation Learning , 2015, J. Mach. Learn. Res..

[21]  Daniele Calandriello,et al.  Sparse multi-task reinforcement learning , 2014, Intelligenza Artificiale.

[22]  Alessandro Lazaric,et al.  Sequential Transfer in Multi-armed Bandit with Finite Set of Models , 2013, NIPS.

[23]  Benjamin Van Roy,et al.  Learning to Optimize via Posterior Sampling , 2013, Math. Oper. Res..

[24]  Massimiliano Pontil,et al.  Excess risk bounds for multitask learning with trace norm regularization , 2012, COLT.

[25]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[26]  Massimiliano Pontil,et al.  Sparse coding for multitask and transfer learning , 2012, ICML.

[27]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[28]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[29]  Aurélien Garivier,et al.  Parametric Bandits: The Generalized Linear Case , 2010, NIPS.

[30]  Claudio Gentile,et al.  Linear Algorithms for Online Multitask Classification , 2010, COLT.

[31]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[32]  J. Tsitsiklis,et al.  Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[33]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[34]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[35]  Jonathan Baxter,et al.  A Model of Inductive Bias Learning , 2000, J. Artif. Intell. Res..

[36]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[37]  Jan N. van Rijn,et al.  Metalearning: Applications to Automated Machine Learning and Data Mining , 2022, Cognitive Technologies.

[38]  M. Ghavamzadeh,et al.  Parameter and Feature Selection in Stochastic Linear Bandits , 2021, ArXiv.

[39]  Massimiliano Pontil,et al.  Learning To Learn Around A Common Mean , 2018, NeurIPS.

[40]  Marta Soare Multi-task Linear Bandits , 2014 .

[41]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.