Bandit Overfitting in Offline Policy Learning.
暂无分享,去创建一个
[1] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.
[2] Peter Szolovits,et al. Deep Reinforcement Learning for Sepsis Treatment , 2017, ArXiv.
[3] Csaba Szepesvári,et al. Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..
[4] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[5] Stefan Wager,et al. Efficient Policy Learning , 2017, ArXiv.
[6] Francis R. Bach,et al. Breaking the Curse of Dimensionality with Convex Neural Networks , 2014, J. Mach. Learn. Res..
[7] Zhengyuan Zhou,et al. Offline Multi-Action Policy Learning: Generalization and Optimization , 2018, Oper. Res..
[8] John Langford,et al. Doubly Robust Policy Evaluation and Learning , 2011, ICML.
[9] Samy Bengio,et al. Understanding deep learning requires rethinking generalization , 2016, ICLR.
[10] Romain Laroche,et al. Safe Policy Improvement with Baseline Bootstrapping , 2017, ICML.
[11] Philip M. Long,et al. Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.
[12] Mikhail Belkin,et al. Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate , 2018, NeurIPS.
[13] Nathan Kallus,et al. Balanced Policy Evaluation and Learning , 2017, NeurIPS.
[14] Lihong Li,et al. Learning from Logged Implicit Exploration Data , 2010, NIPS.
[15] Thomas M. Cover,et al. Estimation by the nearest neighbor rule , 1968, IEEE Trans. Inf. Theory.
[16] Barnabás Póczos,et al. Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.
[17] Doina Precup,et al. Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.
[18] Mikhail Belkin,et al. Does data interpolation contradict statistical optimality? , 2018, AISTATS.
[19] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.
[20] Minmin Chen,et al. Surrogate Objectives for Batch Policy Optimization in One-step Decision Making , 2019, NeurIPS.
[21] Nan Jiang,et al. Information-Theoretic Considerations in Batch Reinforcement Learning , 2019, ICML.
[22] Wei Chu,et al. A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.
[23] Abhinav Gupta,et al. Supersizing self-supervision: Learning to grasp from 50K tries and 700 robot hours , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).
[24] Joaquin Quiñonero Candela,et al. Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..
[25] Thorsten Joachims,et al. Counterfactual Risk Minimization: Learning from Logged Bandit Feedback , 2015, ICML.
[26] Léon Bottou,et al. The Tradeoffs of Large Scale Learning , 2007, NIPS.
[27] M. de Rijke,et al. Deep Learning with Logged Bandit Feedback , 2018, ICLR.
[28] Thorsten Joachims,et al. The Self-Normalized Estimator for Counterfactual Learning , 2015, NIPS.
[29] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .
[30] Vladimir Vapnik,et al. Estimation of Dependences Based on Empirical Data: Springer Series in Statistics (Springer Series in Statistics) , 1982 .
[31] Peter E. Hart,et al. Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.
[32] John Langford,et al. The offset tree for learning with partial labels , 2008, KDD.
[33] John Langford,et al. The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information , 2007, NIPS.
[34] V. Vapnik. Estimation of Dependences Based on Empirical Data , 2006 .
[35] Barbara E. Engelhardt,et al. A Reinforcement Learning Approach to Weaning of Mechanical Ventilation in Intensive Care Units , 2017, UAI.