Graphical Models Meet Bandits: A Variational Thompson Sampling Approach

We propose a novel framework for structured bandits, which we call an influence diagram bandit. Our framework uses a graphical model to capture complex statistical dependencies between actions, latent variables, and observations; and thus unifies and extends many existing models, such as combinatorial semi-bandits, cascading bandits, and low-rank bandits. We develop novel online learning algorithms that learn to act efficiently in our models. The key idea is to track a structured posterior distribution of model parameters, either exactly or approximately. To act, we sample model parameters from their posterior and then use the structure of the influence diagram to find the most optimistic action under the sampled parameters. We empirically evaluate our algorithms in three structured bandit problems, and show that they perform as well as or better than problem-specific state-of-the-art baselines.

[1]  Ole J. Mengshoel,et al.  Customized Nonlinear Bandits for Online Response Selection in Neural Conversation Models , 2017, AAAI.

[2]  M. de Rijke,et al.  Click Models for Web Search , 2015, Click Models for Web Search.

[3]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[4]  Shuai Li,et al.  Contextual Combinatorial Cascading Bandits , 2016, ICML.

[5]  Shie Mannor,et al.  Thompson Sampling for Complex Online Problems , 2013, ICML.

[6]  Audrey Durand,et al.  Old Dog Learns New Tricks: Randomized UCB for Bandit Problems , 2020, AISTATS.

[7]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[8]  Branislav Kveton,et al.  Efficient Learning in Large-Scale Combinatorial Semi-Bandits , 2014, ICML.

[9]  Robert D. Nowak,et al.  Active Positive Semidefinite Matrix Completion: Algorithms, Theory and Applications , 2017, AISTATS.

[10]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[11]  Iñigo Urteaga,et al.  Variational inference for the multi-armed contextual bandit , 2017, AISTATS.

[12]  Long Tran-Thanh,et al.  Efficient Thompson Sampling for Online Matrix-Factorization Recommendation , 2015, NIPS.

[13]  Zheng Wen,et al.  Cascading Bandits: Learning to Rank in the Cascade Model , 2015, ICML.

[14]  Craig Boutilier,et al.  Randomized Exploration in Generalized Linear Bandits , 2019, AISTATS.

[15]  Zheng Wen,et al.  Stochastic Rank-1 Bandits , 2016, AISTATS.

[16]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[17]  Bhaskar Krishnamachari,et al.  Combinatorial Network Optimization With Unknown Variables: Multi-Armed Bandits With Linear Rewards and Individual Observations , 2010, IEEE/ACM Transactions on Networking.

[18]  Olivier Cappé,et al.  Multiple-Play Bandits in the Position-Based Model , 2016, NIPS.

[19]  Albin Cassirer,et al.  Randomized Prior Functions for Deep Reinforcement Learning , 2018, NeurIPS.

[20]  Zheng Wen,et al.  Efficient online recommendation via low-rank ensemble sampling , 2018, RecSys.

[21]  Shuai Li,et al.  TopRank: A practical algorithm for online stochastic ranking , 2018, NeurIPS.

[22]  Nando de Freitas,et al.  An Introduction to MCMC for Machine Learning , 2004, Machine Learning.

[23]  Wtt Wtt Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits , 2015 .

[24]  Ross D. Shachter,et al.  Dynamic programming and influence diagrams , 1990, IEEE Trans. Syst. Man Cybern..

[25]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[26]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[27]  Julien Cornebise,et al.  Weight Uncertainty in Neural Network , 2015, ICML.

[28]  Tor Lattimore,et al.  Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits , 2018, ICML.

[29]  Alexandre Proutière,et al.  Minimal Exploration in Structured Stochastic Bandits , 2017, NIPS.

[30]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[31]  Craig Boutilier,et al.  Perturbed-History Exploration in Stochastic Multi-Armed Bandits , 2019, IJCAI.

[32]  Benjamin Van Roy,et al.  Learning to Optimize via Posterior Sampling , 2013, Math. Oper. Res..

[33]  Changyou Chen,et al.  Stochastic Particle-Optimization Sampling and the Non-Asymptotic Convergence Theory , 2018, AISTATS.

[34]  Craig Boutilier,et al.  Perturbed-History Exploration in Stochastic Linear Bandits , 2019, UAI.

[35]  Yasin Abbasi-Yadkori,et al.  Thompson Sampling and Approximate Inference , 2019, NeurIPS.

[36]  Ronald A. Howard,et al.  Influence Diagrams , 2005, Decis. Anal..

[37]  Julian Zimmert,et al.  Factored Bandits , 2018, NeurIPS.

[38]  Benjamin Van Roy,et al.  Ensemble Sampling , 2017, NIPS.

[39]  Robert D. Nowak,et al.  Bilinear Bandits with Low-rank Structure , 2019, ICML.

[40]  Jasper Snoek,et al.  Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling , 2018, ICLR.

[41]  Lawrence Carin,et al.  Scalable Thompson Sampling via Optimal Transport , 2019, AISTATS.