Offline Neural Contextual Bandits: Pessimism, Optimization and Generalization

Offline policy learning (OPL) leverages existing data collected a priori for policy optimization without any active exploration. Despite the prevalence and recent interest in this problem, its theoretical and algorithmic foundations in function approximation settings remain under-developed. In this paper, we consider this problem on the axes of distributional shift, optimization, and generalization in offline contextual bandits with neural networks. In particular, we propose a provably efficient offline contextual bandit with neural network function approximation that does not require any functional assumption on the reward. We show that our method provably generalizes over unseen contexts under a milder condition for distributional shift than the existing OPL works. Notably, unlike any other OPL method, our method learns from the offline data in an online manner using stochastic gradient descent, allowing us to leverage the benefits of online learning into an offline setting. Moreover, we show that our method is more computationally efficient and has a better dependence on the effective dimension of the neural network than an online counterpart. Finally, we demonstrate the empirical effectiveness of our method in a range of synthetic and real-world OPL problems.

[1]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[2]  Svetha Venkatesh,et al.  On Sample Complexity of Offline Reinforcement Learning with Deep ReLU Networks in Besov Spaces , 2021, Trans. Mach. Learn. Res..

[3]  Xinkun Nie,et al.  Learning When-to-Treat Policies , 2019, Journal of the American Statistical Association.

[4]  Ruosong Wang,et al.  On Exact Computation with an Infinitely Wide Neural Net , 2019, NeurIPS.

[5]  Banghua Zhu,et al.  Bridging Offline Reinforcement Learning and Imitation Learning: A Tale of Pessimism , 2021, IEEE Transactions on Information Theory.

[6]  Lihong Li,et al.  Learning from Logged Implicit Exploration Data , 2010, NIPS.

[7]  J. Langford,et al.  The Epoch-Greedy algorithm for contextual multi-armed bandits , 2007, NIPS 2007.

[8]  Quanquan Gu,et al.  Neural Contextual Bandits with UCB-based Exploration , 2019, ICML.

[9]  Boris Hanin,et al.  Finite Depth and Width Corrections to the Neural Tangent Kernel , 2019, ICLR.

[10]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[11]  Doina Precup,et al.  Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[12]  John Langford,et al.  Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits , 2014, ICML.

[13]  Alessandro Lazaric,et al.  Leveraging Good Representations in Linear Contextual Bandits , 2021, ICML.

[14]  Chi Jin,et al.  Risk Bounds and Rademacher Complexity in Batch Reinforcement Learning , 2021, ICML.

[15]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[16]  Yu Bai,et al.  Near Optimal Provable Uniform Convergence in Off-Policy Evaluation for Reinforcement Learning , 2021, AISTATS.

[17]  Andreas Krause,et al.  Contextual Gaussian Process Bandit Optimization , 2011, NIPS.

[18]  Nan Jiang,et al.  Information-Theoretic Considerations in Batch Reinforcement Learning , 2019, ICML.

[19]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[20]  俊一 甘利 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[21]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[22]  Handong Zhao,et al.  Neural Contextual Bandits with Deep Representation and Shallow Exploration , 2020, ICLR.

[23]  Mengdi Wang,et al.  Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation , 2020, ICML.

[24]  Csaba Szepesvári,et al.  –armed Bandits , 2022 .

[25]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[26]  Svetha Venkatesh,et al.  Combining Online Learning and Offline Learning for Contextual Bandits with Deficient Support , 2021, ArXiv.

[27]  Mengdi Wang,et al.  Reinforcement Leaning in Feature Space: Matrix Bandit, Kernels, and Regret Bound , 2019, ICML.

[28]  Toru Kitagawa,et al.  Who should be Treated? Empirical Welfare Maximization Methods for Treatment Choice , 2015 .

[29]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[30]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[31]  Quynh Nguyen,et al.  On the Proof of Global Convergence of Gradient Descent for Deep ReLU Networks with Linear Widths , 2021, ICML.

[32]  Martin A. Riedmiller,et al.  Batch Reinforcement Learning , 2012, Reinforcement Learning.

[33]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[34]  Michael I. Jordan,et al.  On Function Approximation in Reinforcement Learning: Optimism in the Face of Large State Spaces , 2021 .

[35]  Yuan Cao,et al.  Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks , 2019, NeurIPS.

[36]  S. Levine,et al.  Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems , 2020, ArXiv.

[37]  Fredrik D. Johansson,et al.  Guidelines for reinforcement learning in healthcare , 2019, Nature Medicine.

[38]  Yu-Xiang Wang,et al.  Characterizing Uniform Convergence in Offline Policy Evaluation via model-based approach: Offline Learning, Task-Agnostic and Reward-Free , 2021 .

[39]  Jasper Snoek,et al.  Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling , 2018, ICLR.

[40]  Yu-Xiang Wang,et al.  Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning , 2020, AISTATS.

[41]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[42]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[43]  Nello Cristianini,et al.  Finite-Time Analysis of Kernelised Contextual Bandits , 2013, UAI.

[44]  Masatoshi Uehara,et al.  Fast Rates for the Regret of Offline Reinforcement Learning , 2021, COLT.

[45]  Mikhail Belkin,et al.  Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation , 2021, Acta Numerica.

[46]  Rajesh Ranganath,et al.  Bandit Overfitting in Offline Policy Learning. , 2020 .

[47]  Philip S. Thomas,et al.  Predictive Off-Policy Policy Evaluation for Nonstationary Decision Problems, with Applications to Digital Marketing , 2017, AAAI.

[48]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[49]  Tor Lattimore,et al.  On the Optimality of Batch Policy Optimization Algorithms , 2021, ICML.

[50]  Stefan Wager,et al.  Policy Learning With Observational Data , 2017, Econometrica.

[51]  Eli Upfal,et al.  Multi-Armed Bandits in Metric Spaces ∗ , 2008 .

[52]  Zhuoran Yang,et al.  Is Pessimism Provably Efficient for Offline RL? , 2020, ICML.