A Relation Analysis of Markov Decision Process Frameworks

We study the relation between different Markov Decision Process (MDP) frameworks in the machine learning and econometrics literatures, including the standard MDP, the entropy and general regularized MDP, and stochastic MDP, where the latter is based on the assumption that the reward function is stochastic and follows a given distribution. We show that the entropy-regularized MDP is equivalent to a stochastic MDP model, and is strictly subsumed by the general regularized MDP. Moreover, we propose a distributional stochastic MDP framework by assuming that the distribution of the reward function is ambiguous. We further show that the distributional stochastic MDP is equivalent to the regularized MDP, in the sense that they always yield the same optimal policies. We also provide a connection between stochastic/regularized MDP and constrained MDP. Our work gives a unified view on several important MDP frameworks, which would lead new ways to interpret the (entropy/general) regularized MDP frameworks through the lens of stochastic rewards and vice-versa. Given the recent popularity of regularized MDP in (deep) reinforcement learning, our work brings new understandings of how such algorithmic schemes work and suggest ideas to develop new ones.

[1]  Pieter Abbeel,et al.  Equivalence Between Policy Gradients and Soft Q-Learning , 2017, ArXiv.

[2]  Stefano Ermon,et al.  Learning Large-Scale Dynamic Discrete Choice Models of Spatio-Temporal Preferences with Application to Migratory Pastoralism in East Africa , 2015, AAAI.

[3]  Zizhuo Wang,et al.  On the Relation Between Several Discrete Choice Models , 2015 .

[4]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[5]  Garud Iyengar,et al.  Robust Dynamic Programming , 2005, Math. Oper. Res..

[6]  Yuval Tassa,et al.  Relative Entropy Regularized Policy Iteration , 2018, ArXiv.

[7]  Victor Aguirregabiria,et al.  Dynamic Discrete Choice Structural Models: A Survey , 2010, SSRN Electronic Journal.

[8]  Kyungjae Lee,et al.  Sparse Markov Decision Processes With Causal Sparse Tsallis Entropy Regularization for Reinforcement Learning , 2018, IEEE Robotics and Automation Letters.

[9]  Roy Fox,et al.  Taming the Noise in Reinforcement Learning via Soft Updates , 2015, UAI.

[10]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[11]  J. Steele Probability theory and combinatorial optimization , 1987 .

[12]  Sergey Levine,et al.  A Connection between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models , 2016, ArXiv.

[13]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[14]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[15]  Henry Zhu,et al.  Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[16]  Hilbert J. Kappen,et al.  Dynamic policy programming , 2010, J. Mach. Learn. Res..

[17]  Chung-Piaw Teo,et al.  Persistency Model and Its Applications in Choice Modeling , 2009, Manag. Sci..

[18]  John Rust Optimal Replacement of GMC Bus Engines: An Empirical Model of Harold Zurcher , 1987 .

[19]  Matthieu Geist,et al.  A Theory of Regularized Markov Decision Processes , 2019, ICML.

[20]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[21]  A. Anas Discrete choice theory, information theory and the multinomial logit and gravity models , 1983 .

[22]  Ofir Nachum,et al.  Path Consistency Learning in Tsallis Entropy Regularized MDPs , 2018, ICML.

[23]  Dimitris Bertsimas,et al.  Persistence in discrete optimization under data uncertainty , 2006, Math. Program..

[24]  Xiaobo Li,et al.  On Theoretical and Empirical Aspects of Marginal Distribution Choice Models , 2014, Manag. Sci..

[25]  Anind K. Dey,et al.  Modeling Interaction via the Principle of Maximum Causal Entropy , 2010, ICML.

[26]  Yuval Tassa,et al.  Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[27]  Sergey Levine,et al.  Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , 2017, ICLR 2017.

[28]  John Rust Maximum likelihood estimation of discrete control processes , 1988 .

[29]  Martin A. Riedmiller,et al.  Robust Reinforcement Learning for Continuous Control with Model Misspecification , 2019, ICLR.

[30]  E. Altman Constrained Markov Decision Processes , 1999 .

[31]  J. Kadane Structural Analysis of Discrete Data with Econometric Applications , 1984 .

[32]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.