论文信息 - High Confidence Policy Improvement

High Confidence Policy Improvement

We present a batch reinforcement learning (RL) algorithm that provides probabilistic guarantees about the quality of each policy that it proposes, and which has no hyper-parameters that require expert tuning. The user may select any performance lower-bound, ρ-, and confidence level, δ, and our algorithm will ensure that the probability that it returns a policy with performance below ρ- is at most δ. We then propose an incremental algorithm that executes our policy improvement algorithm repeatedly to generate multiple policy improvements. We show the viability of our approach with a simple gridworld and the standard mountain car problem, as well as with a digital marketing application that uses real world data.

[1] Debashis Kushary,et al. Bootstrap Methods and Their Application , 2000, Technometrics.

[2] J. Pankow,et al. Prediction of coronary heart disease in middle-aged adults with diabetes. , 2003, Diabetes care.

[3] Dirk Van den Poel,et al. Joint optimization of customer segmentation and marketing policy to maximize long-term profitability , 2002, Expert Syst. Appl..

[4] Philip S. Thomas,et al. High-Confidence Off-Policy Evaluation , 2015, AAAI.

[5] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[6] Joaquin Quiñonero Candela,et al. Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[7] Larry D. Pyeatt,et al. Reinforcement Learning for Closed-Loop Propofol Anesthesia: A Human Volunteer Study , 2010, IAAI.

[8] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[9] John Langford,et al. Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[10] Nikolaus Hansen,et al. The CMA Evolution Strategy: A Comparing Review , 2006, Towards a New Evolutionary Computation.

[11] Y. Benjamini,et al. Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[12] Michail G. Lagoudakis,et al. Model-Free Least-Squares Policy Iteration , 2001, NIPS.

[13] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[14] Sham M. Kakade,et al. On the sample complexity of reinforcement learning. , 2003 .

[15] Pierre Geurts,et al. Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[16] Andrew W. Moore,et al. Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[17] Daniele Calandriello,et al. Safe Policy Iteration , 2013, ICML.

[18] George Konidaris,et al. Value Function Approximation in Reinforcement Learning Using the Fourier Basis , 2011, AAAI.

[19] M. Kenward,et al. An Introduction to the Bootstrap , 2007 .

[20] J Carpenter,et al. Bootstrap confidence intervals: when, which, what? A practical guide for medical statisticians. , 2000, Statistics in medicine.

[21] B. Efron. Better Bootstrap Confidence Intervals , 1987 .