Information Directed Sampling and Bandits with Heteroscedastic Noise

In the stochastic bandit problem, the goal is to maximize an unknown function via a sequence of noisy function evaluations. Typically, the observation noise is assumed to be independent of the evaluation point and satisfies a tail bound taken uniformly on the domain. In this work, we consider the setting of heteroscedastic noise, that is, we explicitly allow the noise distribution to depend on the evaluation point. We show that this leads to new trade-offs for information and regret, which are not taken into account by existing approaches like upper confidence bound algorithms (UCB) or Thompson Sampling. To address these shortcomings, we introduce a frequentist regret framework, that is similar to the Bayesian analysis of Russo and Van Roy (2014). We prove a new high-probability regret bound for general, possibly randomized policies, depending on a quantity we call the regret-information ratio. From this bound, we define a frequentist version of Information Directed Sampling (IDS) to minimize a surrogate of the regret-information ratio over all possible action sampling distributions. In order to construct the surrogate function, we generalize known concentration inequalities for least squares regression in separable Hilbert spaces to the case of heteroscedastic noise. This allows us to formulate several variants of IDS for linear and reproducing kernel Hilbert space response functions, yielding a family of novel algorithms for Bayesian optimization. We also provide frequentist regret bounds, which in the homoscedastic case are comparable to existing bounds for UCB, but can be much better when the noise is heteroscedastic. Finally, we empirically demonstrate in a linear setting, that some of our methods can outperform UCB and Thompson Sampling, even when the noise is homoscedastic.

[1]  Benjamin Van Roy,et al.  An Information-Theoretic Analysis of Thompson Sampling , 2014, J. Mach. Learn. Res..

[2]  Benjamin Van Roy,et al.  Learning to Optimize via Information-Directed Sampling , 2014, NIPS.

[3]  A. C. Aitken IV.—On Least Squares and Linear Combination of Observations , 1936 .

[4]  Ambuj Tewari,et al.  On the Generalization Ability of Online Strongly Convex Programming Algorithms , 2008, NIPS.

[5]  Csaba Szepesvari,et al.  Online learning for linearly parametrized control problems , 2012 .

[6]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[7]  Koby Crammer,et al.  Linear Multi-Resource Allocation with Semi-Bandit Feedback , 2015, NIPS.

[8]  Yu. V. Prokhorov Convergence of Random Processes and Limit Theorems in Probability Theory , 1956 .

[9]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[10]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[11]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[12]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[13]  Rajendra Bhatia,et al.  A Better Bound on the Variance , 2000, Am. Math. Mon..

[14]  A. Burnetas,et al.  Optimal Adaptive Policies for Sequential Allocation Problems , 1996 .

[15]  Aditya Gopalan,et al.  On Kernelized Multi-armed Bandits , 2017, ICML.

[16]  Michael N. Katehakis,et al.  Normal Bandits of Unknown Means and Variances: Asymptotic Optimality, Finite Horizon Regret Bounds, and a Solution to an Open Problem , 2015, ArXiv.

[17]  Annie Marsden,et al.  Sequential Matrix Completion , 2017, ArXiv.

[18]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[19]  Varun Grover,et al.  Active learning in heteroscedastic noise , 2010, Theor. Comput. Sci..

[20]  Tor Lattimore,et al.  A Scale Free Algorithm for Stochastic Bandits with Bounded Kurtosis , 2017, NIPS.

[21]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[22]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[23]  D. Freedman On Tail Probabilities for Martingales , 1975 .

[24]  Nando de Freitas,et al.  Heteroscedastic Treed Bayesian Optimisation , 2014, ArXiv.

[25]  Tor Lattimore,et al.  The End of Optimism? An Asymptotic Analysis of Finite-Armed Linear Bandits , 2016, AISTATS.

[26]  Xiequan Fan,et al.  Exponential inequalities for martingales with applications , 2013, 1311.6273.

[27]  Wolfram Burgard,et al.  Most likely heteroscedastic Gaussian process regression , 2007, ICML '07.

[28]  Paul W. Goldberg,et al.  Regression with Input-dependent Noise: A Gaussian Process Treatment , 1997, NIPS.

[29]  Nagarajan Natarajan,et al.  Active Heteroscedastic Regression , 2017, ICML.

[30]  Zi Wang,et al.  Max-value Entropy Search for Efficient Bayesian Optimization , 2017, ICML.

[31]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[32]  Alessandro Lazaric,et al.  Linear Thompson Sampling Revisited , 2016, AISTATS.