Fast Bellman Updates for Robust MDPs

We describe two efficient, and exact, algorithms for computing Bellman updates in robust Markov decision processes (MDPs). The first algorithm uses a homotopy continuation method to compute updates for L1-constrained s, a-rectangular ambiguity sets. It runs in quasi-linear time for plain L1 norms and also generalizes to weighted L1 norms. The second algorithm uses bisection to compute updates for robust MDPs with s-rectangular ambiguity sets. This algorithm, when combined with the homotopy method, also has a quasi-linear runtime. Unlike previous methods, our algorithms compute the primal solution in addition to the optimal objective value, which makes them useful in policy iteration methods. Our experimental results indicate that the proposed methods are over 1,000 times faster than Gurobi, a state-of-the-art commercial optimization package, for small instances, and the performance gap grows considerably with problem size.

[1]  David L. Donoho,et al.  Solution of l1Minimization Problems by LARS/Homotopy Methods , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[2]  E. Ordentlich,et al.  Inequalities for the L1 Deviation of the Empirical Distribution , 2003 .

[3]  John N. Tsitsiklis,et al.  Introduction to linear optimization , 1997, Athena scientific optimization and computation series.

[4]  Paul H. Zipkin,et al.  Foundations of Inventory Management , 2000 .

[5]  J. Tsitsiklis,et al.  Robust, risk-sensitive, and data-driven control of markov decision processes , 2007 .

[6]  Laurent El Ghaoui,et al.  An Homotopy Algorithm for the Lasso with Online Observations , 2008, NIPS.

[7]  Marek Petrik,et al.  RAAM: The Benefits of Robustness in Approximating Aggregated MDPs in Reinforcement Learning , 2014, NIPS.

[8]  Shie Mannor,et al.  Scaling Up Robust MDPs using Function Approximation , 2014, ICML.

[9]  Marek Petrik,et al.  Safe Policy Improvement by Minimizing Robust Baseline Regret , 2016, NIPS.

[10]  Yoram Singer,et al.  Efficient projections onto the {\it l}$_{\mbox{1}}$-ball for learning in high dimensions , 2008, ICML 2008.

[11]  Shie Mannor,et al.  Parametric regret in uncertain Markov decision processes , 2009, Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference.

[12]  Alexandre M. Bayen,et al.  Projected sub-gradient with ℓ1 or simplex constraints via isotonic regression , 2015, 2015 54th IEEE Conference on Decision and Control (CDC).

[13]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[14]  Shie Mannor,et al.  Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty , 2012, ICML.

[15]  Daniel Kuhn,et al.  Robust Markov Decision Processes , 2013, Math. Oper. Res..

[16]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[17]  Marek Petrik,et al.  Approximate Dynamic Programming By Minimizing Distributionally Robust Bounds , 2012, ICML.

[18]  Robert J. Vanderbei,et al.  Linear Programming: Foundations and Extensions , 1998, Kluwer international series in operations research and management service.

[19]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[20]  Laurent El Ghaoui,et al.  Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[21]  Thomas G. Dietterich,et al.  PAC optimal MDP planning with application to invasive species management , 2015, J. Mach. Learn. Res..

[22]  Lihong Li,et al.  Reinforcement Learning in Finite MDPs: PAC Analysis , 2009, J. Mach. Learn. Res..

[23]  Justin K. Romberg,et al.  Dantzig selector homotopy with dynamic measurements , 2009, Electronic Imaging.

[24]  Garud Iyengar,et al.  Robust Dynamic Programming , 2005, Math. Oper. Res..

[25]  Andrew J. Schaefer,et al.  Robust Modified Policy Iteration , 2013, INFORMS J. Comput..

[26]  Peter Bro Miltersen,et al.  Strategy Iteration Is Strongly Polynomial for 2-Player Turn-Based Stochastic Games with a Constant Discount Factor , 2010, JACM.

[27]  Shie Mannor,et al.  The Robustness-Performance Tradeoff in Markov Decision Processes , 2006, NIPS.

[28]  Scott Sanner,et al.  Real-time dynamic programming for Markov decision processes with imprecise probabilities , 2016, Artif. Intell..

[29]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .