Direct Policy Search and Uncertain Policy Evaluation

Reinforcement learning based on direct search in policy space requires few assumptions about the environment. Hence it is applicable in certain situations where most traditional reinforcement learning algorithms are not, especially in partially observable, deterministic worlds. In realistic settings, however, reliable policy evaluations are complicated by numerous sources of uncertainty, such as stochasticity in policy and environment. Given a limited life-time, how much time should a direct policy searcher spend on policy evaluations to obtain reliable statistics? Our efficient approach based on the success-story algorithm (SSA) is radical in the sense that it never stops evaluating any previous policy modification except those it undoes for lack of empirical evidence that they have contributed to lifelong reward accelerations. While previous experimental research has already demonstrated SSA''s applicability to large-scale partially observable environments, a study of why it performs well has been lacking. Here we identify for the first time SSA''s fundamental advantages over traditional direct policy search (such as stochastic hill-climbing) on problems involving several sources of stochasticity and uncertainty.

[1]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[2]  Dave Cliff,et al.  Adding Temporary Memory to ZCS , 1994, Adapt. Behav..

[3]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[4]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[5]  Long-Ji Lin,et al.  Reinforcement learning for robots using neural networks , 1992 .

[6]  Arthur L. Samuel,et al.  Some studies in machine learning using the game of checkers , 2000, IBM J. Res. Dev..

[7]  Jürgen Schmidhuber,et al.  HQ-Learning , 1997, Adapt. Behav..

[8]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[9]  Jürgen Schmidhuber,et al.  Reinforcement Learning in Markovian and Non-Markovian Environments , 1990, NIPS.

[10]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[11]  M. V. Rossum,et al.  In Neural Computation , 2022 .

[12]  J. Bather,et al.  Multi‐Armed Bandit Allocation Indices , 1990 .

[13]  Nichael Lynn Cramer,et al.  A Representation for the Adaptive Generation of Simple Sequential Programs , 1985, ICGA.

[14]  Jürgen Schmidhuber,et al.  Reinforcement Learning with Self-Modifying Policies , 1998, Learning to Learn.

[15]  Leslie Pack Kaelbling,et al.  On reinforcement learning for robots , 1996, IROS.

[16]  C A Nelson,et al.  Learning to Learn , 2017, Encyclopedia of Machine Learning and Data Mining.

[17]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[18]  Pat Langley,et al.  Learning Cooperative Lane Selection Strategies for Highways , 1998, AAAI/IAAI.

[19]  Toshio Odanaka,et al.  ADAPTIVE CONTROL PROCESSES , 1990 .

[20]  Michael L. Littman,et al.  Algorithms for Sequential Decision Making , 1996 .

[21]  Andrew McCallum,et al.  Overcoming Incomplete Perception with Utile Distinction Memory , 1993, ICML.

[22]  Ari Juels,et al.  Stochastic Hillclimbing as a Baseline Method for , 1994 .

[23]  Leslie Pack Kaelbling,et al.  Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[24]  Christian M. Ernst,et al.  Multi-armed Bandit Allocation Indices , 1989 .

[25]  R. Bellman,et al.  V. Adaptive Control Processes , 1964 .

[26]  Mark B. Ring Continual learning in reinforcement environments , 1995, GMD-Bericht.

[27]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[28]  Martin Wattenberg,et al.  Stochastic Hillclimbing as a Baseline Mathod for Evaluating Genetic Algorithms , 1995, NIPS.