Nonparametric Gaussian mixture models for the multi-armed contextual bandit

We here adopt Bayesian nonparametric mixture models to extend multi-armed bandits in general, and Thompson sampling in particular, to complex scenarios where there is reward model uncertainty. The multi-armed bandit is a sequential allocation task where an agent must learn a policy that maximizes long term payoff, where only the reward of the played arm is observed at each interaction with the world. In the stochastic bandit setting, at each interaction, the reward for the selected action is generated from an unknown distribution. Thompson sampling is a generative and interpretable multi-armed bandit algorithm that has been shown both to perform well in practice, and to enjoy optimality properties for certain reward functions. Nevertheless, Thompson sampling requires knowledge of the true reward model, for calculation of expected rewards and sampling from its parameter posterior. In this work, we extend Thompson sampling to complex scenarios where there is model uncertainty, by adopting a very flexible set of reward distributions: nonparametric Gaussian mixture models. The generative process of Bayesian nonparametric mixtures naturally aligns with the Bayesian modeling of multi-armed bandits: the nonparametric model autonomously determines its complexity in an online fashion, as new rewards are observed for the played arms. By characterizing each arm's reward distribution with independent Dirichlet process mixtures and per-mixture parameters, the proposed method sequentially learns the model that best approximates the true underlying reward distribution, achieving successful performance in synthetic and real datasets. Our contribution is valuable for practical scenarios, as it avoids stringent case-by-case model specifications, and yet attains reduced regret in diverse bandit settings.

[1]  Apostolos Burnetas,et al.  Optimal Adaptive Policies for Markov Decision Processes , 1997, Math. Oper. Res..

[2]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[3]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[4]  Steven L. Scott,et al.  Multi-armed bandit experiments in the online service economy , 2015 .

[5]  S. Walker,et al.  Extending Doob's consistency theorem to nonparametric densities , 2004 .

[6]  Michael I. Jordan,et al.  Hierarchical Bayesian Nonparametric Models with Applications , 2008 .

[7]  Y. Teh,et al.  Multi-Armed Bandit for Species Discovery: A Bayesian Nonparametric Approach , 2018 .

[8]  Yi Ouyang,et al.  Learning Unknown Markov Decision Processes: A Thompson Sampling Approach , 2017, NIPS.

[9]  D. Dunson,et al.  Nonparametric Bayesian density estimation on manifolds with applications to planar shapes. , 2010, Biometrika.

[10]  M. Escobar,et al.  Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[11]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[12]  A. V. D. Vaart,et al.  Entropies and rates of convergence for maximum likelihood and Bayes estimation for mixtures of normal densities , 2001 .

[13]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[14]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[15]  H. Robbins,et al.  Asymptotically efficient adaptive allocation rules , 1985 .

[16]  Shie Mannor,et al.  Thompson Sampling for Learning Parameterized Markov Decision Processes , 2014, COLT.

[17]  David B. Dunson,et al.  Posterior consistency in conditional distribution estimation , 2013, J. Multivar. Anal..

[18]  A. V. D. Vaart,et al.  Convergence rates of posterior distributions , 2000 .

[19]  Benjamin Van Roy,et al.  A Tutorial on Thompson Sampling , 2017, Found. Trends Mach. Learn..

[20]  Eyke Hüllermeier,et al.  On the bayes-optimality of F-measure maximizers , 2013, J. Mach. Learn. Res..

[21]  Michael I. Jordan,et al.  Bayesian Nonparametrics: Hierarchical Bayesian nonparametric models with applications , 2010 .

[22]  Benjamin Van Roy,et al.  An Information-Theoretic Analysis of Thompson Sampling , 2014, J. Mach. Learn. Res..

[23]  Samuel J. Gershman,et al.  A Tutorial on Bayesian Nonparametric Models , 2011, 1106.2697.

[24]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[25]  Sham Kakade,et al.  An Optimal Algorithm for Linear Bandits , 2011, ArXiv.

[26]  Iñigo Urteaga,et al.  Variational inference for the multi-armed contextual bandit , 2017, AISTATS.

[27]  Andreas Krause,et al.  Contextual Gaussian Process Bandit Optimization , 2011, NIPS.

[28]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[29]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[30]  Iñigo Urteaga,et al.  (Sequential) Importance Sampling Bandits , 2018, ArXiv.

[31]  Julien Cornebise,et al.  Weight Uncertainty in Neural Network , 2015, ICML.

[32]  W. R. Thompson On the Theory of Apportionment , 1935 .

[33]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[34]  Michalis K. Titsias,et al.  Variational Learning of Inducing Variables in Sparse Gaussian Processes , 2009, AISTATS.

[35]  John Langford,et al.  The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information , 2007, NIPS.

[36]  J. Ghosh,et al.  POSTERIOR CONSISTENCY OF DIRICHLET MIXTURES IN DENSITY ESTIMATION , 1999 .

[37]  Ariel D. Procaccia,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[38]  Shipra Agrawal,et al.  Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.

[39]  Wei Chu,et al.  A case study of behavior-driven conjoint analysis on Yahoo!: front page today module , 2009, KDD.

[40]  Rémi Munos,et al.  A Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergences , 2011, COLT.

[41]  Zoubin Ghahramani,et al.  Sparse Gaussian Processes using Pseudo-inputs , 2005, NIPS.

[42]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.

[43]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[44]  Jasper Snoek,et al.  Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling , 2018, ICLR.

[45]  Abhijit Gosavi,et al.  Reinforcement Learning: A Tutorial Survey and Recent Advances , 2009, INFORMS J. Comput..

[46]  Lawrence Carin,et al.  Preconditioned Stochastic Gradient Langevin Dynamics for Deep Neural Networks , 2015, AAAI.

[47]  Steven L. Scott,et al.  A modern Bayesian look at the multi-armed bandit , 2010 .

[48]  Benjamin Van Roy,et al.  Learning to Optimize via Posterior Sampling , 2013, Math. Oper. Res..

[49]  John Shawe-Taylor,et al.  Regret Bounds for Gaussian Process Bandit Problems , 2010, AISTATS 2010.

[50]  A. V. D. Vaart,et al.  Posterior convergence rates of Dirichlet mixtures at smooth densities , 2007, 0708.1885.

[51]  Barbara E. Engelhardt,et al.  PG-TS: Improved Thompson Sampling for Logistic Contextual Bandits , 2018, NeurIPS.

[52]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[53]  John N. Tsitsiklis,et al.  Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[54]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[55]  Rémi Munos,et al.  Thompson Sampling for 1-Dimensional Exponential Family Bandits , 2013, NIPS.