论文信息 - Contextual Gaussian Process Bandit Optimization

Contextual Gaussian Process Bandit Optimization

How should we design experiments to maximize performance of a complex system, taking into account uncontrollable environmental conditions? How should we select relevant documents (ads) to display, given information about the user? These tasks can be formalized as contextual bandit problems, where at each round, we receive context (about the experimental conditions, the query), and have to choose an action (parameters, documents). The key challenge is to trade off exploration by gathering data for estimating the mean payoff function over the context-action space, and to exploit by choosing an action deemed optimal based on the gathered data. We model the payoff function as a sample from a Gaussian process defined over the joint context-action space, and develop CGP-UCB, an intuitive upper-confidence style algorithm. We show that by mixing and matching kernels for contexts and actions, CGP-UCB can handle a variety of practical applications. We further provide generic tools for deriving regret bounds when using such composite kernel functions. Lastly, we evaluate our algorithm on two case studies, in the context of automated vaccine design and sensor management. We show that context-sensitive optimization outperforms no or naive use of context.

Andreas Krause | Cheng Soon Ong | Andreas Krause

[1] G. Wahba. Spline models for observational data , 1990 .

[2] Thomas M. Cover,et al. Elements of Information Theory , 2005 .

[3] Peter Auer,et al. Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[4] Philip M. Long,et al. Reinforcement Learning with Immediate Rewards and Linear Hypotheses , 2003, Algorithmica.

[5] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[6] Anton Schwaighofer,et al. Learning Gaussian processes from multiple tasks , 2005, ICML.

[7] Morten Nielsen,et al. A Community Resource Benchmarking Predictions of Peptide Binding to MHC-I Molecules , 2006, PLoS Comput. Biol..

[8] Csaba Szepesvári,et al. Bandit Based Monte-Carlo Planning , 2006, ECML.

[9] Edwin V. Bonilla,et al. Multi-task Gaussian Process Prediction , 2007, NIPS.

[10] J. Langford,et al. The Epoch-Greedy algorithm for contextual multi-armed bandits , 2007, NIPS 2007.

[11] Tao Wang,et al. Automatic Gait Optimization with Gaussian Process Regression , 2007, IJCAI.