Semiparametric Contextual Bandits

This paper studies semiparametric contextual bandits, a generalization of the linear stochastic bandit problem where the reward for an action is modeled as a linear function of known action features confounded by an non-linear action-independent term. We design new algorithms that achieve $\tilde{O}(d\sqrt{T})$ regret over $T$ rounds, when the linear function is $d$-dimensional, which matches the best known bounds for the simpler unconfounded case and improves on a recent result of Greenewald et al. (2017). Via an empirical evaluation, we show that our algorithms outperform prior approaches when there are non-linear confounding effects on the rewards. Technically, our algorithms use a new reward estimator inspired by doubly-robust approaches and our proofs require new concentration inequalities for self-normalized martingales.

[1]  P. Bickel Efficient and Adaptive Estimation for Semiparametric Models , 1993 .

[2]  Benjamin Van Roy,et al.  Learning to Optimize via Posterior Sampling , 2013, Math. Oper. Res..

[3]  Karthik Sridharan,et al.  BISTRO: An Efficient Relaxation-Based Method for Contextual Bandits , 2016, ICML.

[4]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[5]  John Langford,et al.  Contextual Bandit Algorithms with Supervised Learning Guarantees , 2010, AISTATS.

[6]  Aurélien Garivier,et al.  Parametric Bandits: The Generalized Linear Case , 2010, NIPS.

[7]  J. Robins,et al.  Double/Debiased Machine Learning for Treatment and Causal Parameters , 2016, 1608.00060.

[8]  Lihong Li,et al.  Provable Optimal Algorithms for Generalized Linear Contextual Bandits , 2017, ArXiv.

[9]  M. Sion On general minimax theorems , 1958 .

[10]  P. Robinson ROOT-N-CONSISTENT SEMIPARAMETRIC REGRESSION , 1988 .

[11]  T. Lai,et al.  Theory and applications of multivariate self-normalized processes , 2009 .

[12]  John Langford,et al.  Off-policy evaluation for slate recommendation , 2016, NIPS.

[13]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[14]  J. Robins,et al.  Recovery of Information and Adjustment for Dependent Censoring Using Surrogate Markers , 1992 .

[15]  T. Lai,et al.  Self-Normalized Processes: Limit Theory and Statistical Applications , 2001 .

[16]  Nicolò Cesa-Bianchi,et al.  Combinatorial Bandits , 2012, COLT.

[17]  John Langford,et al.  The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information , 2007, NIPS.

[18]  D. Freedman On Tail Probabilities for Martingales , 1975 .

[19]  A. Tsiatis Semiparametric Theory and Missing Data , 2006 .

[20]  John N. Tsitsiklis,et al.  Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[21]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[22]  Akshay Krishnamurthy,et al.  Efficient Algorithms for Adversarial Contextual Learning , 2016, ICML.

[23]  John Langford,et al.  Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits , 2014, ICML.

[24]  W. Newey,et al.  Double machine learning for treatment and causal parameters , 2016 .

[25]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[26]  Ambuj Tewari,et al.  From Ads to Interventions: Contextual Bandits in Mobile Health , 2017, Mobile Health - Sensors, Analytic Methods, and Applications.

[27]  Shie Mannor,et al.  Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[28]  Akshay Krishnamurthy,et al.  Contextual semibandits via supervised learning oracles , 2015, NIPS.

[29]  Kristjan H. Greenewald,et al.  Action Centered Contextual Bandits , 2017, NIPS.

[30]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[31]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[32]  Sham M. Kakade,et al.  Towards Minimax Policies for Online Linear Optimization with Bandit Feedback , 2012, COLT.

[33]  Aad van der Vaart,et al.  Higher order influence functions and minimax estimation of nonlinear functionals , 2008, 0805.3040.

[34]  Thomas M. Cover,et al.  Behavior of sequential predictors of binary sequences , 1965 .

[35]  Elad Hazan,et al.  Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization , 2008, COLT.