The Unintended Consequences of Discount Regularization: Improving Regularization in Certainty Equivalence Reinforcement Learning

Discount regularization, using a shorter planning horizon when calculating the optimal policy, is a popular choice to restrict planning to a less complex set of policies when estimating an MDP from sparse or noisy data (Jiang et al., 2015). It is commonly understood that discount regularization functions by de-emphasizing or ignoring delayed effects. In this paper, we reveal an alternate view of discount regularization that exposes unintended consequences. We demonstrate that planning under a lower discount factor produces an identical optimal policy to planning using any prior on the transition matrix that has the same distribution for all states and actions. In fact, it functions like a prior with stronger regularization on state-action pairs with more transition data. This leads to poor performance when the transition matrix is estimated from data sets with uneven amounts of data across state-action pairs. Our equivalence theorem leads to an explicit formula to set regularization parameters locally for individual state-action pairs rather than globally. We demonstrate the failures of discount regularization and how we remedy them using our state-action-specific method across simple empirical examples as well as a medical cancer simulator.

[1]  Kelly W. Zhang,et al.  Designing Reinforcement Learning Algorithms for Digital Interventions: Pre-Implementation Guidelines , 2022, Algorithms.

[2]  SeungYeon Kang,et al.  Reinforcement learning-based expanded personalized diabetes treatment recommendation using South Korean electronic health records , 2022, Expert Syst. Appl..

[3]  Christopher Grimm,et al.  Proper Value Equivalence , 2021, NeurIPS.

[4]  Sharad Goel,et al.  Bandit algorithms to personalize educational chatbots , 2021, Machine Learning.

[5]  Satinder Singh,et al.  The Value Equivalence Principle for Model-Based Reinforcement Learning , 2020, NeurIPS.

[6]  Mehrab Singh Gill,et al.  VacSIM: Learning effective strategies for COVID-19 vaccine distribution using reinforcement learning , 2020, Intelligence-Based Medicine.

[7]  Ron Meir,et al.  Discount Factor as a Regularizer in Reinforcement Learning , 2020, ICML.

[8]  Yao Liu,et al.  Interpretable Off-Policy Evaluation in Reinforcement Learning by Highlighting Influential Transitions , 2020, ICML.

[9]  Ian Osband,et al.  Making Sense of Reinforcement Learning and Probabilistic Inference , 2020, ICLR.

[10]  Kristjan H. Greenewald,et al.  Personalized HeartSteps , 2019, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol..

[11]  Silviu Pitis,et al.  Rethinking the Discount Factor in Reinforcement Learning: A Decision Theoretic Approach , 2019, AAAI.

[12]  Christopher Grimm,et al.  Mitigating Planner Overfitting in Model-Based Reinforcement Learning , 2018, ArXiv.

[13]  Joelle Pineau,et al.  Contextual Bandits for Adapting Treatment in a Mouse Model of de Novo Carcinogenesis , 2018, MLHC.

[14]  Emma Brunskill,et al.  Problem Dependent Reinforcement Learning Bounds Which Can Identify Bandit Structure in MDPs , 2018, ICML.

[15]  Martha White,et al.  Unifying Task Specification in Reinforcement Learning , 2016, ICML.

[16]  Shie Mannor,et al.  Bayesian Reinforcement Learning: A Survey , 2015, Found. Trends Mach. Learn..

[17]  Nan Jiang,et al.  The Dependence of Effective Planning Horizon on Model Accuracy , 2015, AAMAS.

[18]  Naoto Yoshida,et al.  Reinforcement learning with state-dependent discount factor , 2013, 2013 IEEE Third Joint International Conference on Development and Learning and Epigenetic Robotics (ICDL).

[19]  Benjamin Van Roy,et al.  (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[20]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[21]  Johan Pallud,et al.  A Tumor Growth Inhibition Model for Low-Grade Glioma Treated with Chemotherapy or Radiotherapy , 2012, Clinical Cancer Research.

[22]  Xianping Guo,et al.  Markov decision processes with state-dependent discount factors and unbounded rewards/costs , 2011, Oper. Res. Lett..

[23]  Joelle Pineau,et al.  A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes , 2011, J. Mach. Learn. Res..

[24]  Richard L. Lewis,et al.  Variance-Based Rewards for Approximate Bayesian Reinforcement Learning , 2010, UAI.

[25]  Lihong Li,et al.  A Bayesian Sampling Approach to Exploration in Reinforcement Learning , 2009, UAI.

[26]  Andrew Y. Ng,et al.  Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[27]  Joelle Pineau,et al.  Bayes-Adaptive POMDPs , 2007, NIPS.

[28]  Jesse Hoey,et al.  An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[29]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[30]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[31]  F. Girosi,et al.  Networks for approximation and learning , 1990, Proc. IEEE.

[32]  André Barreto,et al.  Approximate Value Equivalence , 2022, NeurIPS.

[33]  S. Kakade,et al.  Reinforcement Learning: Theory and Algorithms , 2019 .

[34]  Maosong Sun,et al.  Bandit Learning with Implicit Feedback , 2018, NeurIPS.

[35]  Shie Mannor,et al.  Bayesian Reinforcement Learning , 2010, Encyclopedia of Machine Learning.

[36]  Andrew G. Barto,et al.  Optimal learning: computational procedures for bayes-adaptive markov decision processes , 2002 .

[37]  Peter Stone,et al.  Scaling Reinforcement Learning toward RoboCup Soccer , 2001, ICML.