Reinforcement learning in the presence of rare events

We consider the task of reinforcement learning in an environment in which rare significant events occur independently of the actions selected by the controlling agent. If these events are sampled according to their natural probability of occurring, convergence of conventional reinforcement learning algorithms is likely to be slow, and the learning algorithms may exhibit high variance. In this work, we assume that we have access to a simulator, in which the rare event probabilities can be artificially altered. Then, importance sampling can be used to learn with this simulation data. We introduce algorithms for policy evaluation, using both tabular and function approximation representations of the value function. We prove that in both cases, the reinforcement learning algorithms converge. In the tabular case, we also analyze the bias and variance of our approach compared to TD-learning. We evaluate empirically the performance of the algorithm on random Markov Decision Processes, as well as on a large network planning task.

[1]  C. Cornell,et al.  Adaptive Importance Sampling , 1990 .

[2]  James A. Bucklew,et al.  Introduction to Rare Event Simulation , 2010 .

[3]  Shie Mannor,et al.  Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[4]  Sanjoy Dasgupta,et al.  Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[5]  Peter W. Glynn,et al.  Stochastic Simulation: Algorithms and Analysis , 2007 .

[6]  Doina Precup,et al.  Between MOPs and Semi-MOP: Learning, Planning & Representing Knowledge at Multiple Temporal Scales , 1998 .

[7]  Richard S. Sutton,et al.  Reinforcement Learning , 1992, Handbook of Machine Learning.

[8]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 1998, Machine Learning.

[9]  J. Beck,et al.  A new adaptive importance sampling scheme for reliability calculations , 1999 .

[10]  Paul Bratley,et al.  A guide to simulation , 1983 .

[11]  N. Metropolis,et al.  The Monte Carlo method. , 1949 .

[12]  Sumit Roy,et al.  Adaptive Importance Sampling , 1993, IEEE J. Sel. Areas Commun..

[13]  Paul Bratley,et al.  A guide to simulation (2nd ed.) , 1986 .

[14]  Michael Devetsikiotis,et al.  An algorithmic approach to the optimization of importance sampling parameters in digital communication system simulation , 1993, IEEE Trans. Commun..

[15]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[16]  Dennis D. Cox,et al.  Adaptive importance sampling on discrete Markov chains , 1999 .

[17]  Michael G. Madden,et al.  Transfer of Experience Between Reinforcement Learning Environments with Progressive Difficulty , 2004, Artificial Intelligence Review.

[18]  R. Bellman Dynamic programming. , 1957, Science.

[19]  V.F. Nicola,et al.  Adaptive importance sampling simulation of queueing networks , 2000, 2000 Winter Simulation Conference Proceedings (Cat. No.00CH37165).

[20]  Peter W. Glynn,et al.  A Markov chain perspective on adaptive Monte Carlo algorithms , 2001, Proceeding of the 2001 Winter Simulation Conference (Cat. No.01CH37304).

[21]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[22]  Peter Stone,et al.  Behavior transfer for value-function-based reinforcement learning , 2005, AAMAS '05.

[23]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[24]  Richard S. Sutton,et al.  Reinforcement Learning of Local Shape in the Game of Go , 2007, IJCAI.

[25]  C. Bucher Adaptive sampling — an iterative fast Monte Carlo procedure , 1988 .

[26]  Manuela M. Veloso,et al.  Team-Partitioned, Opaque-Transition Reinforced Learning , 1998, RoboCup.

[28]  Pieter-Tjerk de Boer,et al.  Estimating buffer overflows in three stages using cross-entropy , 2002, Proceedings of the Winter Simulation Conference.

[29]  John N. Tsitsiklis,et al.  Bias and Variance Approximation in Value Function Estimates , 2007, Manag. Sci..

[30]  M. Jackson A Survey of Models of Network Formation: Stability and Efficiency , 2003 .

[31]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[32]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[33]  Avishai Mandelbaum,et al.  Queueing Models of Call Centers: An Introduction , 2002, Ann. Oper. Res..

[34]  H. Robbins A Stochastic Approximation Method , 1951 .

[35]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[36]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[37]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[38]  Donald L. Iglehart,et al.  Importance sampling for stochastic simulations , 1989 .

[39]  P. Balaban,et al.  A Modified Monte-Carlo Simulation Technique for the Evaluation of Error Rate in Digital Communication Systems , 1980, IEEE Trans. Commun..

[40]  Perwez Shahabuddin,et al.  Importance sampling for the simulation of highly reliable Markovian systems , 1994 .

[41]  M. Schwind,et al.  A Reinforcement Learning Approach for Supply Chain Management , 2022 .

[42]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[43]  Larry Wasserman,et al.  All of Statistics: A Concise Course in Statistical Inference , 2004 .

[44]  Paul Glasserman,et al.  Rare-Event Simulation for Multistage Production-Inventory Systems , 1996 .

[45]  Vivek S. Borkar,et al.  Adaptive Importance Sampling Technique for Markov Chains Using Stochastic Approximation , 2006, Oper. Res..

[46]  Vivek S. Borkar,et al.  A Simulation-Based Algorithm for Ergodic Control of Markov Chains Conditioned on Rare Events , 2006, J. Mach. Learn. Res..

[47]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[48]  Dirk P. Kroese,et al.  The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation and Machine Learning , 2004 .

[49]  Lih-Yuan Deng,et al.  The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning , 2006, Technometrics.

[50]  Stuart J. Russell,et al.  Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[51]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[52]  T. E. Booth Exponential convergence for Monte Carlo particle transport , 1985 .

[53]  Moshe Zukerman,et al.  A quantitative measure for telecommunications networks topology design , 2005, IEEE/ACM Transactions on Networking.

[54]  Richard S. Sutton,et al.  Reinforcement learning with replacing eligibility traces , 2004, Machine Learning.

[55]  Philip Heidelberger,et al.  Fast simulation of rare events in queueing and reliability models , 1993, TOMC.

[56]  John N. Tsitsiklis,et al.  Bias and variance in value function estimation , 2004, ICML.

[57]  Peter Stone,et al.  Reinforcement Learning for RoboCup Soccer Keepaway , 2005, Adapt. Behav..

[58]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[59]  Eitan Altman,et al.  A survey on networking games in telecommunications , 2006, Comput. Oper. Res..

[60]  Wei Zhang,et al.  A Reinforcement Learning Approach to job-shop Scheduling , 1995, IJCAI.

[61]  Craig Kollman Rare event simulation in radiation transport , 1993 .

[62]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[63]  Satinder Singh Transfer of Learning by Composing Solutions of Elemental Sequential Tasks , 1992, Mach. Learn..

[64]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[65]  Dirk P. Kroese,et al.  The Cross Entropy Method: A Unified Approach To Combinatorial Optimization, Monte-carlo Simulation (Information Science and Statistics) , 2004 .

[66]  Thomas G. Dietterich The MAXQ Method for Hierarchical Reinforcement Learning , 1998, ICML.

[67]  Peter L. Bartlett,et al.  Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[68]  Manuela M. Veloso,et al.  Team-partitioned, opaque-transition reinforcement learning , 1999, AGENTS '99.