Bandit Overfitting in Offline Policy Learning.

We study the offline policy learning problem in a contextual bandit framework. Specifically, we focus on the issue of overfitting which is especially important in a modern context where we often use overparameterized models that can interpolate the data. Our first contribution is to introduce a regret decomposition into approximation, estimation, and bandit errors that emphasizes the distinction between the policy learning and supervised learning problems. The bandit error measures the error from overfitting to the single action observed at each context, which we call "bandit overfitting". Our second contribution is to show both in theory and experiments how bandit overfitting is different for policy-based versus value-based algorithms when we use overparameterized models. We find that bandit overfitting can become a severe problem for policy-based algorithms, but value-based algorithms effectively reduce the policy learning problem to regression and thus avoid the worst problems of bandit overfitting.

[1]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[2]  Peter Szolovits,et al.  Deep Reinforcement Learning for Sepsis Treatment , 2017, ArXiv.

[3]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[4]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Stefan Wager,et al.  Efficient Policy Learning , 2017, ArXiv.

[6]  Francis R. Bach,et al.  Breaking the Curse of Dimensionality with Convex Neural Networks , 2014, J. Mach. Learn. Res..

[7]  Zhengyuan Zhou,et al.  Offline Multi-Action Policy Learning: Generalization and Optimization , 2018, Oper. Res..

[8]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[9]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[10]  Romain Laroche,et al.  Safe Policy Improvement with Baseline Bootstrapping , 2017, ICML.

[11]  Philip M. Long,et al.  Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.

[12]  Mikhail Belkin,et al.  Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate , 2018, NeurIPS.

[13]  Nathan Kallus,et al.  Balanced Policy Evaluation and Learning , 2017, NeurIPS.

[14]  Lihong Li,et al.  Learning from Logged Implicit Exploration Data , 2010, NIPS.

[15]  Thomas M. Cover,et al.  Estimation by the nearest neighbor rule , 1968, IEEE Trans. Inf. Theory.

[16]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[17]  Doina Precup,et al.  Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[18]  Mikhail Belkin,et al.  Does data interpolation contradict statistical optimality? , 2018, AISTATS.

[19]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[20]  Minmin Chen,et al.  Surrogate Objectives for Batch Policy Optimization in One-step Decision Making , 2019, NeurIPS.

[21]  Nan Jiang,et al.  Information-Theoretic Considerations in Batch Reinforcement Learning , 2019, ICML.

[22]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[23]  Abhinav Gupta,et al.  Supersizing self-supervision: Learning to grasp from 50K tries and 700 robot hours , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[24]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[25]  Thorsten Joachims,et al.  Counterfactual Risk Minimization: Learning from Logged Bandit Feedback , 2015, ICML.

[26]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[27]  M. de Rijke,et al.  Deep Learning with Logged Bandit Feedback , 2018, ICLR.

[28]  Thorsten Joachims,et al.  The Self-Normalized Estimator for Counterfactual Learning , 2015, NIPS.

[29]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[30]  Vladimir Vapnik,et al.  Estimation of Dependences Based on Empirical Data: Springer Series in Statistics (Springer Series in Statistics) , 1982 .

[31]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[32]  John Langford,et al.  The offset tree for learning with partial labels , 2008, KDD.

[33]  John Langford,et al.  The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information , 2007, NIPS.

[34]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[35]  Barbara E. Engelhardt,et al.  A Reinforcement Learning Approach to Weaning of Mechanical Ventilation in Intensive Care Units , 2017, UAI.