Robust Quadratic Programming for MDPs with uncertain observation noise

Abstract The problem of Markov decision processes (MDPs) with uncertain observation noise has rarely been studied. This paper proposes a Robust Quadratic Programming (RQP) approach to approximate Bellman equation solution. Besides efficiency, the proposed algorithm exhibits great robustness against uncertain observation noise, which is essential in real world applications. We further represent the solution into kernel forms, which implicitly expands the state-encoded feature space to higher or even infinite dimensions. Experimental results well justify its efficiency and robustness. The comparison with different kernels demonstrates its flexibility of kernel selection for different application scenarios.

[1]  Dianhui Wang,et al.  Stochastic Configuration Networks: Fundamentals and Algorithms , 2017, IEEE Transactions on Cybernetics.

[2]  Marek Petrik,et al.  Robust Approximate Bilinear Programming for Value Function Approximation , 2011, J. Mach. Learn. Res..

[3]  Kun Zhang,et al.  Tracking control optimization scheme of continuous-time nonlinear system via online single network adaptive critic design method , 2017, Neurocomputing.

[4]  Di Wu,et al.  Reachability analysis of uncertain systems using bounded-parameter Markov decision processes , 2008, Artif. Intell..

[5]  Benjamin Van Roy,et al.  The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..

[6]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[7]  Patrick Jaillet,et al.  Sampling Based Approaches for Minimizing Regret in Uncertain Markov Decision Processes (MDPs) , 2017, J. Artif. Intell. Res..

[8]  Laurent El Ghaoui,et al.  Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[9]  Lorenz T. Biegler,et al.  Iterated linear programming strategies for nonsmooth simulation: Continuous and mixed-integer approaches , 1992 .

[10]  Vivek F. Farias,et al.  Approximate Dynamic Programming via a Smoothed Linear Program , 2009, Oper. Res..

[11]  Ambuj Tewari,et al.  Bounded Parameter Markov Decision Processes with Average Reward Criterion , 2007, COLT.

[12]  André da Motta Salles Barreto,et al.  Practical Kernel-Based Reinforcement Learning , 2014, J. Mach. Learn. Res..

[13]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[14]  Matthijs T. J. Spaan,et al.  Accelerated Vector Pruning for Optimal POMDP Solvers , 2017, AAAI.

[15]  Robert Givan,et al.  Bounded-parameter Markov decision processes , 2000, Artif. Intell..

[16]  Reid G. Simmons,et al.  Point-Based POMDP Algorithms: Improved Analysis and Implementation , 2005, UAI.

[17]  G. Riano,et al.  Linear Programming solvers for Markov Decision Processes , 2006, 2006 IEEE Systems and Information Engineering Design Symposium.

[18]  Shie Mannor,et al.  Distributionally Robust Markov Decision Processes , 2010, Math. Oper. Res..

[19]  Yaodong Ni,et al.  Policy iteration for bounded-parameter POMDPs , 2012, Soft Computing.

[20]  Igor Chikalov,et al.  An algorithm for reduct cardinality minimization , 2013, 2013 IEEE International Conference on Granular Computing (GrC).

[21]  Etienne Perot,et al.  Deep Reinforcement Learning framework for Autonomous Driving , 2017, Autonomous Vehicles and Machines.

[22]  Ness B. Shroff,et al.  Markov decision processes with uncertain transition rates: sensitivity and robust control , 2002, Proceedings of the 41st IEEE Conference on Decision and Control, 2002..

[23]  Michael L. Littman,et al.  Reinforcement learning improves behaviour from evaluative feedback , 2015, Nature.

[24]  Robert Givan,et al.  Model Reduction Techniques for Computing Approximately Optimal Solutions for Markov Decision Processes , 1997, UAI.

[25]  Kotaro Hirasawa,et al.  Kernel-Based Least Squares Temporal Difference With Gradient Correction , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[26]  Xin Xu,et al.  Kernel-Based Least Squares Policy Iteration for Reinforcement Learning , 2007, IEEE Transactions on Neural Networks.

[27]  Wen Jin,et al.  The improvements of BP neural network learning algorithm , 2000, WCC 2000 - ICSP 2000. 2000 5th International Conference on Signal Processing Proceedings. 16th World Computer Congress 2000.

[28]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[29]  Stephen P. Boyd,et al.  Enhancing Sparsity by Reweighted ℓ1 Minimization , 2007, 0711.1612.

[30]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[31]  Haibo He,et al.  Adaptive Critic Nonlinear Robust Control: A Survey , 2017, IEEE Transactions on Cybernetics.

[32]  Finale Doshi-Velez,et al.  Robust and Efficient Transfer Learning with Hidden Parameter Markov Decision Processes , 2017, AAAI.

[33]  Jason Pazis,et al.  Non-Parametric Approximate Linear Programming for MDPs , 2011, AAAI.

[34]  Shie Mannor,et al.  Scaling Up Robust MDPs using Function Approximation , 2014, ICML.

[35]  Marek Petrik,et al.  Feature Selection Using Regularization in Approximate Linear Programs for Markov Decision Processes , 2010, ICML.

[36]  Yang Gao,et al.  Online Selective Kernel-Based Temporal Difference Learning , 2013, IEEE Transactions on Neural Networks and Learning Systems.

[37]  Kun Zhang,et al.  Robust Optimal Control Scheme for Unknown Constrained-Input Nonlinear Systems via a Plug-n-Play Event-Sampled Critic-Only Algorithm , 2020, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[38]  Kun Zhang,et al.  Adaptive Fuzzy Fault-Tolerant Tracking Control for Partially Unknown Systems With Actuator Faults via Integral Reinforcement Learning Method , 2019, IEEE Transactions on Fuzzy Systems.

[39]  André da Motta Salles Barreto,et al.  Reinforcement Learning using Kernel-Based Stochastic Factorization , 2011, NIPS.

[40]  Milos Hauskrecht,et al.  Partitioned Linear Programming Approximations for MDPs , 2008, UAI.

[41]  Daniel Kuhn,et al.  Robust Markov Decision Processes , 2013, Math. Oper. Res..

[42]  Chaoxu Mu,et al.  A novel neural optimal control framework with nonlinear dynamics: Closed-loop stability and simulation verification , 2017, Neurocomputing.

[43]  José Carlos Príncipe,et al.  Kernel Temporal Differences for Neural Decoding , 2015, Comput. Intell. Neurosci..

[44]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[45]  Benjamin Van Roy,et al.  On Constraint Sampling in the Linear Programming Approach to Approximate Dynamic Programming , 2004, Math. Oper. Res..

[46]  B. Krogh,et al.  State aggregation in Markov decision processes , 2002, Proceedings of the 41st IEEE Conference on Decision and Control, 2002..

[47]  Milos Hauskrecht,et al.  Value-Function Approximations for Partially Observable Markov Decision Processes , 2000, J. Artif. Intell. Res..

[48]  Marc Toussaint,et al.  Temporally extended features in model-based reinforcement learning with partial observability , 2016, Neurocomputing.

[49]  Weihong Zhang,et al.  Speeding Up the Convergence of Value Iteration in Partially Observable Markov Decision Processes , 2011, J. Artif. Intell. Res..

[50]  Yang Gao,et al.  Efficient Average Reward Reinforcement Learning Using Constant Shifting Values , 2016, AAAI.

[51]  John N. Tsitsiklis,et al.  The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[52]  Charles E. Blair,et al.  Computational Difficulties of Bilevel Linear Programming , 1990, Oper. Res..

[53]  Alexander Zadorojniy,et al.  Robustness of policies in constrained Markov decision processes , 2006, IEEE Transactions on Automatic Control.

[54]  Onésimo Hernández-Lerma,et al.  The Linear Programming Approach , 2002 .

[55]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[56]  Joelle Pineau,et al.  Point-based value iteration: An anytime algorithm for POMDPs , 2003, IJCAI.

[57]  Shobha Venkataraman,et al.  Efficient Solution Algorithms for Factored MDPs , 2003, J. Artif. Intell. Res..

[58]  Masashi Sugiyama,et al.  Least Absolute Policy Iteration-A Robust Approach to Value Function Approximation , 2010, IEICE Trans. Inf. Syst..