论文信息 - Sequential Resource Allocation in Linear Stochastic Bandits

Sequential Resource Allocation in Linear Stochastic Bandits

This thesis is dedicated to the study of resource allocation problems in uncertain environments, where an agent can sequentially select which action to take. After each step, the environment returns a noisy observation of the value of the selected action. These observations guide the agent in adapting his resource allocation strategy towards reaching a given objective. In the most typical setting of this kind, the stochastic multi-armed bandit (MAB), it is assumed that each observation is drawn from an unknown probability distribution associated with the selected action and gives no information on the expected value of the other actions. The MAB setting has been widely studied and optimal allocation strategies were proposed to solve various objectives under the MAB assumptions. Here, we consider a variant of the MAB setting where there exists a global linear structure in the environment and by selecting an action, the agent also gathers information on the value of the other actions. Therefore, the agent needs to adapt his resource allocation strategy to exploit the structure in the environment. In particular, we study the design of sequences of actions that the agent should take to reach objectives such as: (i) identifying the best value with a fixed confidence and using a minimum number of pulls, or (ii) minimizing the prediction error on the value of each action. In addition, we investigate how the knowledge gathered by a bandit algorithm in a given environment can be transferred to improve the performance in other similar environments.

Marta Soare | Marta Soare

[1] Alessandro Lazaric,et al. Transfer from Multiple MDPs , 2011, NIPS.

[2] Robert E. Bechhofer,et al. Sequential Identification and Ranking Procedures. , 1968 .

[3] T. L. Lai Andherbertrobbins. Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[4] Sébastien Bubeck,et al. Multiple Identifications in Multi-Armed Bandits , 2012, ICML.

[5] W. J. Studden,et al. Theory Of Optimal Experiments , 1972 .

[6] Wei Chu,et al. Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[7] Andrea Bonarini,et al. Transfer of samples in batch reinforcement learning , 2008, ICML '08.

[8] Ilja Kuzborskij,et al. Learning by Transferring from Auxiliary Hypotheses , 2014, ArXiv.

[9] Matthew Malloy,et al. lil' UCB : An Optimal Exploration Algorithm for Multi-Armed Bandits , 2013, COLT.

[10] Csaba Szepesvári,et al. Exploration-exploitation tradeoff using variance estimates in multi-armed bandits , 2009, Theor. Comput. Sci..

[11] Aurélien Garivier,et al. On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , 2014, J. Mach. Learn. Res..

[12] Shuai Li,et al. Online Clustering of Bandits , 2014, ICML.

[13] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[14] Jason Weston,et al. A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[15] Alessandro Lazaric,et al. Best-Arm Identification in Linear Bandits , 2014, NIPS.

[16] Varun Grover,et al. Active learning in heteroscedastic noise , 2010, Theor. Comput. Sci..

[17] Nando de Freitas,et al. On correlation and budget constraints in model-based bandit optimization with application to automatic machine learning , 2014, AISTATS.

[18] Selin Damla Ahipasaoglu,et al. Solving ellipsoidal inclusion and optimal experimental design problems: theory and algorithms , 2009 .

[19] Lihong Li,et al. Sample Complexity of Multi-task Reinforcement Learning , 2013, UAI.

[20] Friedrich Pukelsheim,et al. Optimal weights for experimental designs on linearly independent support points , 1991 .

[21] Rémi Munos,et al. From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning , 2014, Found. Trends Mach. Learn..

[22] Stanley Osher,et al. A survey on level set methods for inverse problems and optimal design , 2005, European Journal of Applied Mathematics.

[23] P. Bickel,et al. Regularized estimation of large covariance matrices , 2008, 0803.1909.

[24] R. Agrawal. Sample mean based index policies by O(log n) regret for the multi-armed bandit problem , 1995, Advances in Applied Probability.

[25] F. Pukelsheim. Optimal Design of Experiments (Classics in Applied Mathematics) (Classics in Applied Mathematics, 50) , 2006 .

[26] Michael Jackson,et al. Optimal Design of Experiments , 1994 .

[27] Guillaume Sagnol,et al. Submodularity and Randomized rounding techniques for Optimal Experimental Design , 2010, Electron. Notes Discret. Math..

[28] Marta Soare. Active Learning in Linear Stochastic Bandits , 2013 .

[29] D. Titterington. Optimal design: Some geometrical aspects of D-optimality , 1975 .

[30] Alessandro Lazaric,et al. Sequential Transfer in Multi-armed Bandit with Finite Set of Models , 2013, NIPS.

[31] Koby Crammer,et al. Learning from Multiple Sources , 2006, NIPS.

[32] Antonio Torralba,et al. Transfer Learning by Borrowing Examples for Multiclass Object Detection , 2011, NIPS.

[33] Christoph H. Lampert,et al. Curriculum learning of multiple tasks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Qiang Yang,et al. A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[35] Alessandro Lazaric,et al. Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence , 2012, NIPS.

[36] Varun Grover,et al. Active Learning in Multi-armed Bandits , 2008, ALT.

[37] R. Munos,et al. Best Arm Identification in Multi-Armed Bandits , 2010, COLT.

[38] Michèle Sebag,et al. Experimental Design in Dynamical System Identification: A Bandit-Based Active Learning Approach , 2014, ECML/PKDD.

[39] Andrew W. Moore,et al. Hoeffding Races: Accelerating Model Selection Search for Classification and Function Approximation , 1993, NIPS.

[40] Wei Chu,et al. A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[41] Oren Somekh,et al. Almost Optimal Exploration in Multi-Armed Bandits , 2013, ICML.

[42] J. Kiefer,et al. The Equivalence of Two Extremum Problems , 1960, Canadian Journal of Mathematics.

[43] Marta Soare. Multi-task Linear Bandits , 2014 .

[44] Yaming Yu. Monotonic convergence of a general algorithm for computing optimal designs , 2009, 0905.2646.

[45] Thomas P. Hayes,et al. Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.