Batch Policy Learning under Constraints

When learning policies for real-world domains, two important questions arise: (i) how to efficiently use pre-collected off-policy, non-optimal behavior data; and (ii) how to mediate among different competing objectives and constraints. We thus study the problem of batch policy learning under multiple constraints, and offer a systematic solution. We first propose a flexible meta-algorithm that admits any batch reinforcement learning and online learning procedure as subroutines. We then present a specific algorithmic instantiation and provide performance guarantees for the main objective and all constraints. To certify constraint satisfaction, we propose a new and simple method for off-policy policy evaluation (OPE) and derive PAC-style bounds. Our algorithm achieves strong empirical results in different domains, including in a challenging problem of simulated car driving subject to multiple constraints such as lane keeping and smooth driving. We also show experimentally that our OPE method outperforms other popular OPE techniques on a standalone basis, especially in a high-dimensional setting.

[1]  Csaba Szepesvári,et al.  Fitted Q-iteration in continuous action-space MDPs , 2007, NIPS.

[2]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[3]  Ann Nowé,et al.  Multi-objective reinforcement learning using sets of pareto dominating policies , 2014, J. Mach. Learn. Res..

[4]  Sergey Levine,et al.  Guided Policy Search via Approximate Mirror Descent , 2016, NIPS.

[5]  Miroslav Dudík,et al.  Optimal and Adaptive Off-policy Evaluation in Contextual Bandits , 2016, ICML.

[6]  Rémi Munos,et al.  Error Bounds for Approximate Policy Iteration , 2003, ICML.

[7]  Alessandro Lazaric,et al.  Transfer from Multiple MDPs , 2011, NIPS.

[8]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[9]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[10]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[11]  Alessandro Lazaric,et al.  Finite-Sample Analysis of LSTD , 2010, ICML.

[12]  D. Bertsekas Approximate policy iteration: a survey and some new methods , 2011 .

[13]  Marco Pavone,et al.  Chance-constrained dynamic programming with application to risk-aware robotic space exploration , 2015, Auton. Robots.

[14]  J. Andrew Bagnell,et al.  Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[15]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[16]  Alessandro Lazaric,et al.  Finite-sample analysis of least-squares policy iteration , 2012, J. Mach. Learn. Res..

[17]  Yuval Tassa,et al.  Safe Exploration in Continuous Action Spaces , 2018, ArXiv.

[18]  Philip S. Thomas,et al.  Using Options and Covariance Testing for Long Horizon Off-Policy Policy Evaluation , 2017, NIPS.

[19]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[20]  John Langford,et al.  A Reductions Approach to Fair Classification , 2018, ICML.

[21]  Peter L. Bartlett,et al.  Efficient agnostic learning of neural networks with bounded fan-in , 1996, IEEE Trans. Inf. Theory.

[22]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[23]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[24]  Yisong Yue,et al.  Smooth Imitation Learning for Online Sequence Prediction , 2016, ICML.

[25]  Reid G. Simmons,et al.  The Effect of Representation and Knowledge on Goal-Directed Exploration with Reinforcement-Learning Algorithms , 2005, Machine Learning.

[26]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[27]  Sanjoy Dasgupta,et al.  Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[28]  Yann LeCun,et al.  Model-Predictive Policy Learning with Uncertainty Regularization for Driving in Dense Traffic , 2019, ICLR.

[29]  Thorsten Joachims,et al.  Batch learning from logged bandit feedback through counterfactual risk minimization , 2015, J. Mach. Learn. Res..

[30]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[31]  Shie Mannor,et al.  Regularized Policy Iteration , 2008, NIPS.

[32]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[33]  Martin A. Riedmiller,et al.  Batch Reinforcement Learning , 2012, Reinforcement Learning.

[34]  Qiang Liu,et al.  Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.

[35]  Liming Xiang,et al.  Kernel-Based Reinforcement Learning , 2006, ICIC.

[36]  Mehrdad Farajtabar,et al.  More Robust Doubly Robust Off-policy Evaluation , 2018, ICML.

[37]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[38]  E. Rowland Theory of Games and Economic Behavior , 1946, Nature.

[39]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[40]  P. Bougerol,et al.  Strict Stationarity of Generalized Autoregressive Processes , 1992 .

[41]  Swarat Chaudhuri,et al.  Control Regularization for Reduced Variance Reinforcement Learning , 2019, ICML.

[42]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[43]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[44]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[45]  Sergey Levine,et al.  Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics , 2014, NIPS.

[46]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[47]  Masahiro Ono,et al.  Chance-Constrained Optimal Path Planning With Obstacles , 2011, IEEE Transactions on Robotics.

[48]  Satinder Singh,et al.  Self-Imitation Learning , 2018, ICML.

[49]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[50]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[51]  Martin A. Riedmiller,et al.  Reinforcement learning for robot soccer , 2009, Auton. Robots.

[52]  David Haussler,et al.  Sphere Packing Numbers for Subsets of the Boolean n-Cube with Bounded Vapnik-Chervonenkis Dimension , 1995, J. Comb. Theory, Ser. A.

[53]  Y. Freund,et al.  Adaptive game playing using multiplicative weights , 1999 .

[54]  Alessandro Lazaric,et al.  Finite-sample Analysis of Bellman Residual Minimization , 2010, ACML.

[55]  Byron Boots,et al.  Accelerating Imitation Learning with Predictive Models , 2018, AISTATS.

[56]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[57]  Abbas Mehrabian,et al.  Nearly-tight VC-dimension bounds for piecewise linear neural networks , 2017, COLT.

[58]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[59]  J. Andrew Bagnell,et al.  Reinforcement and Imitation Learning via Interactive No-Regret Learning , 2014, ArXiv.

[60]  Tom Schaul,et al.  Deep Q-learning From Demonstrations , 2017, AAAI.

[61]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[62]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[63]  Rémi Munos,et al.  Performance Bounds in Lp-norm for Approximate Value Iteration , 2007, SIAM J. Control. Optim..

[64]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[65]  Hervé Frezza-Buet,et al.  Sample-efficient batch reinforcement learning for dialogue management optimization , 2011, TSLP.

[66]  Shimon Whiteson,et al.  A Survey of Multi-Objective Sequential Decision-Making , 2013, J. Artif. Intell. Res..