论文信息 - Semiparametric Contextual Bandits - 字舞流文

Semiparametric Contextual Bandits

This paper studies semiparametric contextual bandits, a generalization of the linear stochastic bandit problem where the reward for an action is modeled as a linear function of known action features confounded by an non-linear action-independent term. We design new algorithms that achieve $\tilde{O}(d\sqrt{T})$ regret over $T$ rounds, when the linear function is $d$-dimensional, which matches the best known bounds for the simpler unconfounded case and improves on a recent result of Greenewald et al. (2017). Via an empirical evaluation, we show that our algorithms outperform prior approaches when there are non-linear confounding effects on the rewards. Technically, our algorithms use a new reward estimator inspired by doubly-robust approaches and our proofs require new concentration inequalities for self-normalized martingales.

Akshay Krishnamurthy | Zhiwei Steven Wu | Vasilis Syrgkanis | A. Krishnamurthy | Vasilis Syrgkanis

[1] P. Bickel. Efficient and Adaptive Estimation for Semiparametric Models , 1993 .

[2] Benjamin Van Roy,et al. Learning to Optimize via Posterior Sampling , 2013, Math. Oper. Res..

[3] Karthik Sridharan,et al. BISTRO: An Efficient Relaxation-Based Method for Contextual Bandits , 2016, ICML.

[4] Peter Auer,et al. The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[5] John Langford,et al. Contextual Bandit Algorithms with Supervised Learning Guarantees , 2010, AISTATS.

[6] Aurélien Garivier,et al. Parametric Bandits: The Generalized Linear Case , 2010, NIPS.

[7] J. Robins,et al. Double/Debiased Machine Learning for Treatment and Causal Parameters , 2016, 1608.00060.

[8] Lihong Li,et al. Provable Optimal Algorithms for Generalized Linear Contextual Bandits , 2017, ArXiv.

[9] M. Sion. On general minimax theorems , 1958 .

[10] P. Robinson. ROOT-N-CONSISTENT SEMIPARAMETRIC REGRESSION , 1988 .

[11] T. Lai,et al. Theory and applications of multivariate self-normalized processes , 2009 .

[12] John Langford,et al. Off-policy evaluation for slate recommendation , 2016, NIPS.

[13] Wei Chu,et al. A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[14] J. Robins,et al. Recovery of Information and Adjustment for Dependent Censoring Using Surrogate Markers , 1992 .

[15] T. Lai,et al. Self-Normalized Processes: Limit Theory and Statistical Applications , 2001 .

[16] Nicolò Cesa-Bianchi,et al. Combinatorial Bandits , 2012, COLT.

[17] John Langford,et al. The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information , 2007, NIPS.

[18] D. Freedman. On Tail Probabilities for Martingales , 1975 .

[19] A. Tsiatis. Semiparametric Theory and Missing Data , 2006 .

[20] John N. Tsitsiklis,et al. Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[21] Wei Chu,et al. Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[22] Akshay Krishnamurthy,et al. Efficient Algorithms for Adversarial Contextual Learning , 2016, ICML.

[23] John Langford,et al. Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits , 2014, ICML.

[24] W. Newey,et al. Double machine learning for treatment and causal parameters , 2016 .

[25] Shipra Agrawal,et al. Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[26] Ambuj Tewari,et al. From Ads to Interventions: Contextual Bandits in Mobile Health , 2017, Mobile Health - Sensors, Analytic Methods, and Applications.

[27] Shie Mannor,et al. Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[28] Akshay Krishnamurthy,et al. Contextual semibandits via supervised learning oracles , 2015, NIPS.

[29] Kristjan H. Greenewald,et al. Action Centered Contextual Bandits , 2017, NIPS.

[30] Csaba Szepesvári,et al. Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[31] Thomas P. Hayes,et al. Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[32] Sham M. Kakade,et al. Towards Minimax Policies for Online Linear Optimization with Bandit Feedback , 2012, COLT.

[33] Aad van der Vaart,et al. Higher order influence functions and minimax estimation of nonlinear functionals , 2008, 0805.3040.

[34] Thomas M. Cover,et al. Behavior of sequential predictors of binary sequences , 1965 .

[35] Elad Hazan,et al. Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization , 2008, COLT.