Policy Certificates: Towards Accountable Reinforcement Learning

The performance of a reinforcement learning algorithm can vary drastically during learning because of exploration. Existing algorithms provide little information about the quality of their current policy before executing it, and thus have limited use in high-stakes applications like healthcare. We address this lack of accountability by proposing that algorithms output policy certificates. These certificates bound the sub-optimality and return of the policy in the next episode, allowing humans to intervene when the certified quality is not satisfactory. We further introduce two new algorithms with certificates and present a new framework for theoretical analysis that guarantees the quality of their policies and certificates. For tabular MDPs, we show that computing certificates can even improve the sample-efficiency of optimism-based exploration. As a result, one of our algorithms is the first to achieve minimax-optimal PAC bounds up to lower-order terms, and this algorithm also matches (and in some settings slightly improves upon) existing minimax regret bounds.

[1]  Emma Brunskill,et al.  Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , 2019, ICML.

[2]  Benjamin Van Roy,et al.  (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[3]  Tor Lattimore,et al.  Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , 2017, NIPS.

[4]  Richard S. Sutton,et al.  Multi-step Off-policy Learning Without Importance Sampling Ratios , 2017, ArXiv.

[5]  Michael L. Littman,et al.  An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[6]  Thomas J. Walsh,et al.  Knows what it knows: a framework for self-aware learning , 2008, ICML '08.

[7]  Philip S. Thomas,et al.  High Confidence Policy Improvement , 2015, ICML.

[8]  Philip S. Thomas,et al.  High-Confidence Off-Policy Evaluation , 2015, AAAI.

[9]  Chong Li,et al.  Model-Free Reinforcement Learning , 2019, Reinforcement Learning for Cyber-Physical Systems.

[10]  Jon D. McAuliffe,et al.  Uniform, nonparametric, non-asymptotic confidence sequences , 2018 .

[11]  Yasin Abbasi-Yadkori,et al.  Online learning in MDPs with side information , 2014, ArXiv.

[12]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[13]  Aaron Roth,et al.  Fair Learning in Markovian Environments , 2016, ArXiv.

[14]  Csaba Szepesvári,et al.  Model-based reinforcement learning with nearly tight exploration complexity bounds , 2010, ICML.

[15]  Maria-Florina Balcan,et al.  The true sample complexity of active learning , 2010, Machine Learning.

[16]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[17]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[18]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[19]  Shlomo Zilberstein,et al.  Optimal Composition of Real-Time Systems , 1996, Artif. Intell..

[20]  Sampath Kannan,et al.  Fairness Incentives for Myopic Agents , 2017, EC.

[21]  Aaron Roth,et al.  Fairness in Reinforcement Learning , 2016, ICML.

[22]  Benjamin Van Roy,et al.  Model-based Reinforcement Learning and the Eluder Dimension , 2014, NIPS.

[23]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[24]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[25]  Martha White,et al.  High-confidence error estimates for learned value functions , 2018, UAI.

[26]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[27]  Tor Lattimore,et al.  PAC Bounds for Discounted MDPs , 2012, ALT.

[28]  Michael I. Jordan,et al.  Is Q-learning Provably Efficient? , 2018, NeurIPS.

[29]  Daniele Calandriello,et al.  Safe Policy Iteration , 2013, ICML.

[30]  Nan Jiang,et al.  On Oracle-Efficient PAC RL with Rich Observations , 2018, NeurIPS.

[31]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[32]  Christoph Dann,et al.  Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning , 2015, NIPS.

[33]  A. Preliminaries Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016 .

[34]  Marek Petrik,et al.  Safe Policy Improvement by Minimizing Robust Baseline Regret , 2016, NIPS.

[35]  Nan Jiang,et al.  On Polynomial Time PAC Reinforcement Learning with Rich Observations , 2018, ArXiv.

[36]  Nan Jiang,et al.  On Oracle-Efficient PAC Reinforcement Learning with Rich Observations , 2018 .

[37]  Zhiwei Steven Wu,et al.  The Externalities of Exploration and How Data Diversity Helps Exploitation , 2018, COLT.

[38]  Shie Mannor,et al.  Contextual Markov Decision Processes , 2015, ArXiv.

[39]  Nan Jiang,et al.  Markov Decision Processes with Continuous Side Information , 2017, ALT.

[40]  Benjamin Van Roy,et al.  Generalization and Exploration via Randomized Value Functions , 2014, ICML.

[41]  Lihong Li,et al.  Reinforcement Learning in Finite MDPs: PAC Analysis , 2009, J. Mach. Learn. Res..

[42]  Aaron Roth,et al.  Fairness in Learning: Classic and Contextual Bandits , 2016, NIPS.

[43]  Nan Jiang,et al.  Contextual Decision Processes with low Bellman rank are PAC-Learnable , 2016, ICML.