Sequential Resource Allocation in Linear Stochastic Bandits

This thesis is dedicated to the study of resource allocation problems in uncertain environments, where an agent can sequentially select which action to take. After each step, the environment returns a noisy observation of the value of the selected action. These observations guide the agent in adapting his resource allocation strategy towards reaching a given objective. In the most typical setting of this kind, the stochastic multi-armed bandit (MAB), it is assumed that each observation is drawn from an unknown probability distribution associated with the selected action and gives no information on the expected value of the other actions. The MAB setting has been widely studied and optimal allocation strategies were proposed to solve various objectives under the MAB assumptions. Here, we consider a variant of the MAB setting where there exists a global linear structure in the environment and by selecting an action, the agent also gathers information on the value of the other actions. Therefore, the agent needs to adapt his resource allocation strategy to exploit the structure in the environment. In particular, we study the design of sequences of actions that the agent should take to reach objectives such as: (i) identifying the best value with a fixed confidence and using a minimum number of pulls, or (ii) minimizing the prediction error on the value of each action. In addition, we investigate how the knowledge gathered by a bandit algorithm in a given environment can be transferred to improve the performance in other similar environments.

[1]  Alessandro Lazaric,et al.  Transfer from Multiple MDPs , 2011, NIPS.

[2]  Robert E. Bechhofer,et al.  Sequential Identification and Ranking Procedures. , 1968 .

[3]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[4]  Sébastien Bubeck,et al.  Multiple Identifications in Multi-Armed Bandits , 2012, ICML.

[5]  W. J. Studden,et al.  Theory Of Optimal Experiments , 1972 .

[6]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[7]  Andrea Bonarini,et al.  Transfer of samples in batch reinforcement learning , 2008, ICML '08.

[8]  Ilja Kuzborskij,et al.  Learning by Transferring from Auxiliary Hypotheses , 2014, ArXiv.

[9]  Matthew Malloy,et al.  lil' UCB : An Optimal Exploration Algorithm for Multi-Armed Bandits , 2013, COLT.

[10]  Csaba Szepesvári,et al.  Exploration-exploitation tradeoff using variance estimates in multi-armed bandits , 2009, Theor. Comput. Sci..

[11]  Aurélien Garivier,et al.  On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , 2014, J. Mach. Learn. Res..

[12]  Shuai Li,et al.  Online Clustering of Bandits , 2014, ICML.

[13]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[14]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[15]  Alessandro Lazaric,et al.  Best-Arm Identification in Linear Bandits , 2014, NIPS.

[16]  Varun Grover,et al.  Active learning in heteroscedastic noise , 2010, Theor. Comput. Sci..

[17]  Nando de Freitas,et al.  On correlation and budget constraints in model-based bandit optimization with application to automatic machine learning , 2014, AISTATS.

[18]  Selin Damla Ahipasaoglu,et al.  Solving ellipsoidal inclusion and optimal experimental design problems: theory and algorithms , 2009 .

[19]  Lihong Li,et al.  Sample Complexity of Multi-task Reinforcement Learning , 2013, UAI.

[20]  Friedrich Pukelsheim,et al.  Optimal weights for experimental designs on linearly independent support points , 1991 .

[21]  Rémi Munos,et al.  From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning , 2014, Found. Trends Mach. Learn..

[22]  Stanley Osher,et al.  A survey on level set methods for inverse problems and optimal design , 2005, European Journal of Applied Mathematics.

[23]  P. Bickel,et al.  Regularized estimation of large covariance matrices , 2008, 0803.1909.

[24]  R. Agrawal Sample mean based index policies by O(log n) regret for the multi-armed bandit problem , 1995, Advances in Applied Probability.

[25]  F. Pukelsheim Optimal Design of Experiments (Classics in Applied Mathematics) (Classics in Applied Mathematics, 50) , 2006 .

[26]  Michael Jackson,et al.  Optimal Design of Experiments , 1994 .

[27]  Guillaume Sagnol,et al.  Submodularity and Randomized rounding techniques for Optimal Experimental Design , 2010, Electron. Notes Discret. Math..

[28]  Marta Soare Active Learning in Linear Stochastic Bandits , 2013 .

[29]  D. Titterington Optimal design: Some geometrical aspects of D-optimality , 1975 .

[30]  Alessandro Lazaric,et al.  Sequential Transfer in Multi-armed Bandit with Finite Set of Models , 2013, NIPS.

[31]  Koby Crammer,et al.  Learning from Multiple Sources , 2006, NIPS.

[32]  Antonio Torralba,et al.  Transfer Learning by Borrowing Examples for Multiclass Object Detection , 2011, NIPS.

[33]  Christoph H. Lampert,et al.  Curriculum learning of multiple tasks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[35]  Alessandro Lazaric,et al.  Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence , 2012, NIPS.

[36]  Varun Grover,et al.  Active Learning in Multi-armed Bandits , 2008, ALT.

[37]  R. Munos,et al.  Best Arm Identification in Multi-Armed Bandits , 2010, COLT.

[38]  Michèle Sebag,et al.  Experimental Design in Dynamical System Identification: A Bandit-Based Active Learning Approach , 2014, ECML/PKDD.

[39]  Andrew W. Moore,et al.  Hoeffding Races: Accelerating Model Selection Search for Classification and Function Approximation , 1993, NIPS.

[40]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[41]  Oren Somekh,et al.  Almost Optimal Exploration in Multi-Armed Bandits , 2013, ICML.

[42]  J. Kiefer,et al.  The Equivalence of Two Extremum Problems , 1960, Canadian Journal of Mathematics.

[43]  Marta Soare Multi-task Linear Bandits , 2014 .

[44]  Yaming Yu Monotonic convergence of a general algorithm for computing optimal designs , 2009, 0905.2646.

[45]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[46]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[47]  A. U.S.,et al.  Sparse Estimation of a Covariance Matrix , 2010 .

[48]  R. Munos,et al.  Kullback–Leibler upper confidence bounds for optimal sequential allocation , 2012, 1210.1136.

[49]  R. Dennis Cook,et al.  Heteroscedastic G-optimal Designs , 1993 .

[50]  I. Johnstone,et al.  ASYMPTOTICALLY OPTIMAL PROCEDURES FOR SEQUENTIAL ADAPTIVE SELECTION OF THE BEST OF SEVERAL NORMAL MEANS , 1982 .

[51]  Shie Mannor,et al.  Latent Bandits , 2014, ICML.

[52]  T. Lai,et al.  Least Squares Estimates in Stochastic Regression Models with Applications to Identification and Control of Dynamic Systems , 1982 .

[53]  Ambuj Tewari,et al.  PAC Subset Selection in Stochastic Multi-armed Bandits , 2012, ICML.

[54]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[55]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[56]  Akimichi Takemura,et al.  An asymptotically optimal policy for finite support models in the multiarmed bandit problem , 2009, Machine Learning.

[57]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[58]  Jinbo Bi,et al.  Active learning via transductive experimental design , 2006, ICML.

[59]  E. Paulson A Sequential Procedure for Selecting the Population with the Largest Mean from $k$ Normal Populations , 1964 .

[60]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[61]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[62]  J. Merikoski,et al.  Inequalities for spreads of matrix sums and products. , 2004 .

[63]  Shie Mannor,et al.  Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[64]  D. Wiens,et al.  V-optimal designs for heteroscedastic regression , 2014 .

[65]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[66]  Alessandro Lazaric,et al.  Multi-Bandit Best Arm Identification , 2011, NIPS.

[67]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[68]  Rémi Munos,et al.  Pure Exploration for Multi-Armed Bandit Problems , 2008, ArXiv.

[69]  Rémi Munos,et al.  Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[70]  Guillaume Sagnol,et al.  Approximation of a maximum-submodular-coverage problem involving spectral functions, with application to experimental designs , 2010, Discret. Appl. Math..

[71]  Shivaram Kalyanakrishnan,et al.  Information Complexity in Bandit Subset Selection , 2013, COLT.

[72]  Yishay Mansour,et al.  Domain Adaptation: Learning Bounds and Algorithms , 2009, COLT.

[73]  W. Fuller,et al.  Estimation for a Linear Regression Model with Unknown Diagonal Covariance Matrix , 1978 .