Parallelizing Contextual Linear Bandits

Standard approaches to decision-making under uncertainty focus on sequential exploration of the space of decisions. However, simultaneously proposing a batch of decisions, which leverages available resources for parallel experimentation, has the potential to rapidly accelerate exploration. We present a family of (parallel) contextual linear bandit algorithms, whose regret is nearly identical to their perfectly sequential counterparts—given access to the same total number of oracle queries—up to a lower-order “burn-in” term that is dependent on the context-set geometry. We provide matching information-theoretic lower bounds on parallel regret performance to establish our algorithms are asymptotically optimal in the time horizon. Finally, we also present an empirical evaluation of these parallel algorithms in several domains, including materials discovery and biological sequence design problems, to demonstrate the utility of parallelized bandits in practical settings.

[1]  Kirthevasan Kandasamy,et al.  Parallelised Bayesian Optimisation via Thompson Sampling , 2018, AISTATS.

[2]  Thorsten Joachims,et al.  Counterfactual Risk Minimization: Learning from Logged Bandit Feedback , 2015, ICML.

[3]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[4]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[5]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[6]  Andreas Krause,et al.  Navigating the protein fitness landscape with Gaussian processes , 2012, Proceedings of the National Academy of Sciences.

[7]  S. Withers,et al.  Teaching old enzymes new tricks: engineering and evolution of glycosidases and glycosyl transferases for improved glycoside synthesis. , 2008, Biochemistry and cell biology = Biochimie et biologie cellulaire.

[8]  Renyuan Xu,et al.  Learning in Generalized Linear Contextual Bandits with Stochastic Delays , 2019, NeurIPS.

[9]  Alessandro Lazaric,et al.  Learning Near Optimal Policies with Low Inherent Bellman Error , 2020, ICML.

[10]  K. Hamidieh A data-driven statistical model for predicting the critical temperature of a superconductor , 2018, Computational Materials Science.

[11]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[12]  Andreas Krause,et al.  Parallelizing Exploration-Exploitation Tradeoffs with Gaussian Process Bandit Optimization , 2012, ICML.

[13]  Joel A. Tropp,et al.  User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[14]  John Langford,et al.  Making Contextual Decisions with Low Technical Debt , 2016 .

[15]  Martin J. Wainwright,et al.  High-Dimensional Statistics , 2019 .

[16]  Jennifer Listgarten,et al.  Design by adaptive sampling , 2018, ArXiv.

[17]  Shuai Li,et al.  Distributed Clustering of Linear Bandits in Peer to Peer Networks , 2016, ICML.

[18]  Benjamin Van Roy,et al.  A Tutorial on Thompson Sampling , 2017, Found. Trends Mach. Learn..

[19]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[20]  Lucy J. Colwell,et al.  Biological Sequence Design using Batched Bayesian Optimization , 2019 .

[21]  Sam Sinai,et al.  A primer on model-guided exploration of fitness landscapes for biological sequence design , 2020, ArXiv.

[22]  Haipeng Luo,et al.  Practical Contextual Bandits with Regression Oracles , 2018, ICML.

[23]  Tor Lattimore,et al.  Learning with Good Feature Representations in Bandits and in RL with a Generative Model , 2020, ICML.

[24]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[25]  István Hegedüs,et al.  Gossip-based distributed stochastic bandit algorithms , 2013, ICML.

[26]  F. Arnold Design by Directed Evolution , 1998 .

[27]  R. Baker,et al.  Mechanistic models versus machine learning, a fight worth fighting for the biological community? , 2018, Biology Letters.

[28]  Shipra Agrawal,et al.  Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.

[29]  G. Winter,et al.  Selection of phage antibodies by binding affinity. Mimicking affinity maturation. , 1992, Journal of molecular biology.

[30]  András György,et al.  Online Learning under Delayed Feedback , 2013, ICML.

[31]  Richard Wang,et al.  AdaLead: A simple and robust adaptive greedy search algorithm for sequence design , 2020, ArXiv.

[32]  John Langford,et al.  Efficient Optimal Learning for Contextual Bandits , 2011, UAI.

[33]  David Dohan,et al.  Population-Based Black-Box Optimization for Biological Sequence Design , 2020, ICML.

[34]  Pushmeet Kohli,et al.  Batched Gaussian Process Bandit Optimization via Determinantal Point Processes , 2016, NIPS.

[35]  Jaie C. Woodard,et al.  Survey of variation in human transcription factors reveals prevalent DNA binding changes , 2016, Science.

[36]  Vianney Perchet,et al.  Stochastic Bandit Models for Delayed Conversions , 2017, UAI.

[37]  Liwei Wang,et al.  Distributed Bandit Learning: Near-Optimal Regret with Efficient Communication , 2019, ICLR.

[38]  John Langford,et al.  A Contextual Bandit Bake-off , 2018, J. Mach. Learn. Res..

[39]  Khashayar Khosravi,et al.  Unreasonable Effectiveness of Greedy Algorithms in Multi-Armed Bandit with Many Arms , 2020, NeurIPS.

[40]  Tor Lattimore,et al.  Contextual Bandits under Delayed Feedback , 2018, ArXiv.

[41]  Eshcar Hillel,et al.  Distributed Exploration in Multi-Armed Bandits , 2013, NIPS.

[42]  Alessandro Lazaric,et al.  Linear Thompson Sampling Revisited , 2016, AISTATS.