Biased Aggregation, Rollout, and Enhanced Policy Improvement for Reinforcement Learning

We propose a new aggregation framework for approximate dynamic programming, which provides a connection with rollout algorithms, approximate policy iteration, and other single and multistep lookahead methods. The central novel characteristic is the use of a bias function $V$ of the state, which biases the values of the aggregate cost function towards their correct levels. The classical aggregation framework is obtained when $V\equiv0$, but our scheme works best when $V$ is a known reasonably good approximation to the optimal cost function $J^*$. When $V$ is equal to the cost function $J_{\mu}$ of some known policy $\mu$ and there is only one aggregate state, our scheme is equivalent to the rollout algorithm based on $\mu$ (i.e., the result of a single policy improvement starting with the policy $\mu$). When $V=J_{\mu}$ and there are multiple aggregate states, our aggregation approach can be used as a more powerful form of improvement of $\mu$. Thus, when combined with an approximate policy evaluation scheme, our approach can form the basis for a new and enhanced form of approximate policy iteration. When $V$ is a generic bias function, our scheme is equivalent to approximation in value space with lookahead function equal to $V$ plus a local correction within each aggregate state. The local correction levels are obtained by solving a low-dimensional aggregate DP problem, yielding an arbitrarily close approximation to $J^*$, when the number of aggregate states is sufficiently large. Except for the bias function, the aggregate DP problem is similar to the one of the classical aggregation framework, and its algorithmic solution by simulation or other methods is nearly identical to one for classical aggregation, assuming values of $V$ are available when needed.

[1]  David Silver,et al.  Value Iteration with Options and State Aggregation , 2015, ArXiv.

[2]  Dimitri P. Bertsekas,et al.  Stabilization of Stochastic Iterative Methods for Singular and Nearly Singular Linear Systems , 2014, Math. Oper. Res..

[3]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[4]  Eric Wiewiora,et al.  Potential-Based Shaping and Q-Value Initialization are Equivalent , 2003, J. Artif. Intell. Res..

[5]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[6]  Dimitri P. Bertsekas,et al.  Discretized Approximations for POMDP with Average Cost , 2004, UAI.

[7]  Bart De Schutter,et al.  Reinforcement Learning and Dynamic Programming Using Function Approximators , 2010 .

[8]  Dimitri P. Bertsekas,et al.  Feature-based aggregation and deep reinforcement learning: a survey and some new implementations , 2018, IEEE/CAA Journal of Automatica Sinica.

[9]  Joelle Pineau,et al.  The Bottleneck Simulator: A Model-based Deep Reinforcement Learning Approach , 2018, J. Artif. Intell. Res..

[10]  Michael L. Littman,et al.  Potential-based Shaping in Model-based Reinforcement Learning , 2008, AAAI.

[11]  Robert L. Smith,et al.  Aggregation in Dynamic Programming , 1987, Oper. Res..

[12]  Michael I. Jordan,et al.  Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[13]  Warren B. Powell,et al.  “Approximate dynamic programming: Solving the curses of dimensionality” by Warren B. Powell , 2007, Wiley Series in Probability and Statistics.

[14]  John N. Tsitsiklis,et al.  An Analysis of Stochastic Shortest Path Problems , 1991, Math. Oper. Res..

[15]  D. Bertsekas,et al.  Solution of Large Systems of Equations Using Approximate Dynamic Programming Methods , 2007 .

[16]  D. Bertsekas,et al.  Journal of Computational and Applied Mathematics Projected Equation Methods for Approximate Solution of Large Linear Systems , 2022 .

[17]  Sam Devlin,et al.  Theoretical considerations of potential-based reward shaping for multi-agent systems , 2011, AAMAS.

[18]  Marek Grzes,et al.  Reward Shaping in Episodic Reinforcement Learning , 2017, AAMAS.

[19]  I. Vakhutinsky,et al.  Iterative Aggregation--A New Approach to the Solution of Large-Scale Problems , 1979 .

[20]  D. Bertsekas,et al.  On the convergence of simulation-based iterative methods for solving singular linear systems , 2013 .

[21]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[22]  W. Miranker,et al.  Acceleration by aggregation of successive approximation methods , 1982 .

[23]  James R. Evans,et al.  Aggregation and Disaggregation Techniques and Methodology in Optimization , 1991, Oper. Res..

[24]  D. Bertsekas,et al.  Adaptive aggregation methods for infinite horizon dynamic programming , 1989 .

[25]  John N. Tsitsiklis,et al.  Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[26]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[27]  Benjamin Van Roy Performance Loss Bounds for Approximate Value Iteration with State Aggregation , 2006, Math. Oper. Res..