A reinforcement learning approach to rare trajectory sampling

Very often when studying non-equilibrium systems one is interested in analysing dynamical behaviour that occurs with very low probability, so called rare events. In practice, since rare events are by definition atypical, they are often difficult to access in a statistically significant way. What are required are strategies to "make rare events typical" so that they can be generated on demand. Here we present such a general approach to adaptively construct a dynamics that efficiently samples atypical events. We do so by exploiting the methods of reinforcement learning (RL), which refers to the set of machine learning techniques aimed at finding the optimal behaviour to maximise a reward associated with the dynamics. We consider the general perspective of dynamical trajectory ensembles, whereby rare events are described in terms of ensemble reweighting. By minimising the distance between a reweighted ensemble and that of a suitably parametrised controlled dynamics we arrive at a set of methods similar to those of RL to numerically approximate the optimal dynamics that realises the rare behaviour of interest. As simple illustrations we consider in detail the problem of excursions of a random walker, for the case of rare events with a finite time horizon; and the problem of a studying current statistics of a particle hopping in a ring geometry, for the case of an infinite time horizon. We discuss natural extensions of the ideas presented here, including to continuous-time Markov systems, first passage time problems and non-Markovian dynamics.

[1]  H. Touchette The large deviation approach to statistical mechanics , 2008, 0804.0327.

[2]  Vivien Lecomte,et al.  A numerical approach to large deviations in continuous time , 2007 .

[3]  Pankaj Mehta,et al.  Reinforcement Learning in Different Phases of Quantum Control , 2017, Physical Review X.

[4]  J. P. Garrahan,et al.  Phases of quantum dimers from ensembles of classical stochastic trajectories , 2018, Physical review B.

[6]  Feng Chen,et al.  Extreme spin squeezing from deep reinforcement learning , 2019, Physical Review A.

[7]  Elman Mansimov,et al.  Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation , 2017, NIPS.

[8]  Lin Lin,et al.  Policy Gradient based Quantum Approximate Optimization Algorithm , 2020, MSML.

[9]  Wulfram Gerstner,et al.  Reinforcement Learning Using a Continuous Time Actor-Critic Framework with Spiking Neurons , 2013, PLoS Comput. Biol..

[10]  Mandayam A. L. Thathachar,et al.  Local and Global Optimization Algorithms for Generalized Learning Automata , 1995, Neural Computation.

[11]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[12]  M. Littman,et al.  Mean Actor Critic , 2017, ArXiv.

[13]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[14]  Robert L. Jack,et al.  Effective interactions and large deviations in stochastic processes , 2015, The European Physical Journal Special Topics.

[15]  Avishek Das,et al.  Variational control forces for enhanced sampling of nonequilibrium molecular dynamics simulations. , 2019, The Journal of chemical physics.

[16]  Richard S. Sutton,et al.  Multi-step Reinforcement Learning: A Unifying Algorithm , 2017, AAAI.

[17]  Vivek S. Borkar,et al.  Q-Learning for Risk-Sensitive Control , 2002, Math. Oper. Res..

[18]  Jeff G. Schneider,et al.  Covariant policy search , 2003, IJCAI 2003.

[20]  Matteo Hessel,et al.  General non-linear Bellman equations , 2019, ArXiv.

[21]  E. Solano,et al.  Reinforcement learning for semi-autonomous approximate quantum eigensolver , 2019, Mach. Learn. Sci. Technol..

[22]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[23]  Michael McCloskey,et al.  Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , 1989 .

[24]  Chris Beeler,et al.  Optimizing thermodynamic trajectories using evolutionary reinforcement learning , 2019, ArXiv.

[25]  Matthieu Geist,et al.  A Theory of Regularized Markov Decision Processes , 2019, ICML.

[26]  Stephen Whitelam,et al.  Evolutionary reinforcement learning of dynamical large deviations , 2019, The Journal of chemical physics.

[27]  Marin Bukov,et al.  Reinforcement learning for autonomous preparation of Floquet-engineered states: Inverting the quantum Kapitza oscillator , 2018, Physical Review B.

[28]  Florian Marquardt,et al.  Reinforcement Learning with Neural Networks for Quantum Feedback , 2018, Physical Review X.

[29]  R. Jack Ergodicity and large deviations in physical systems with stochastic dynamics , 2019, The European Physical Journal B.

[30]  Vivek S. Borkar,et al.  Peformance Analysis Conditioned on Rare Events: An Adaptive Simulation Scheme , 2003, Commun. Inf. Syst..

[31]  Vicenç Gómez,et al.  Optimal control as a graphical model inference problem , 2009, Machine Learning.

[32]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[33]  Vivek S. Borkar,et al.  Adaptive Importance Sampling Technique for Markov Chains Using Stochastic Approximation , 2006, Oper. Res..

[34]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[35]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[36]  S. Whitelam,et al.  Direct evaluation of dynamical large-deviation rate functions using a variational ansatz. , 2019, Physical review. E.

[37]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[38]  J. P. Garrahan,et al.  A Tensor Network Approach to Finite Markov Decision Processes , 2020, ArXiv.

[39]  Richard S. Sutton,et al.  Forward Actor-Critic for Nonlinear Function Approximation in Reinforcement Learning , 2017, AAMAS.

[40]  Vivek S. Borkar,et al.  A Learning Algorithm for Risk-Sensitive Cost , 2008, Math. Oper. Res..

[41]  Sham M. Kakade,et al.  Optimizing Average Reward Using Discounted Rewards , 2001, COLT/EuroCOLT.

[42]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[43]  Garnet Kin-Lic Chan,et al.  Exact Fluctuations of Nonequilibrium Steady States from Approximate Auxiliary Dynamics. , 2017, Physical review letters.

[44]  Harm van Seijen,et al.  Effective Multi-step Temporal-Difference Learning for Non-Linear Function Approximation , 2016, ArXiv.

[45]  P. Dupuis,et al.  Splitting for rare event simulation : A large deviation approach to design and analysis , 2007, 0711.2037.

[46]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[47]  S. Majumdar,et al.  Effective Langevin equations for constrained stochastic processes , 2015, 1503.02639.

[48]  F. Cérou,et al.  Adaptive Multilevel Splitting for Rare Event Analysis , 2007 .

[49]  R. Jack,et al.  Absence of dissipation in trajectory ensembles biased by currents , 2016, 1602.03815.

[50]  J. P. Garrahan,et al.  Using Matrix Product States to Study the Dynamical Large Deviations of Kinetically Constrained Models. , 2019, Physical review letters.

[51]  Ryan P. Adams,et al.  A Theoretical Connection Between Statistical Physics and Reinforcement Learning , 2019, ArXiv.

[52]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[53]  Vivien Lecomte,et al.  Simulating Rare Events in Dynamical Processes , 2011, 1106.4929.

[54]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[55]  M. Yor,et al.  Penalising Brownian Paths , 2009 .

[56]  Shimon Whiteson,et al.  Expected Policy Gradients for Reinforcement Learning , 2018, J. Mach. Learn. Res..

[57]  B. Derrida,et al.  Large deviations conditioned on large deviations I: Markov chain and Langevin equation , 2018, Journal of Statistical Physics.

[58]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[59]  Richard S. Sutton,et al.  Two geometric input transformation methods for fast online reinforcement learning with neural nets , 2018, ArXiv.

[60]  Hado van Hasselt,et al.  Double Q-learning , 2010, NIPS.

[61]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[62]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[63]  Hilbert J. Kappen,et al.  Adaptive Importance Sampling for Control and Inference , 2015, ArXiv.

[64]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[65]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[66]  J. Hooyberghs,et al.  Density-matrix renormalization-group study of current and activity fluctuations near nonequilibrium phase transitions. , 2008, Physical review. E, Statistical, nonlinear, and soft matter physics.

[67]  Martha White,et al.  An Off-policy Policy Gradient Theorem Using Emphatic Weightings , 2018, NeurIPS.

[68]  Sina Ghiassian,et al.  Overcoming Catastrophic Interference in Online Reinforcement Learning with Dynamic Self-Organizing Maps , 2019, ArXiv.

[69]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[70]  Peter Sollich,et al.  Large deviations and ensembles of trajectories in stochastic models , 2009, 0911.0211.

[71]  Martha White,et al.  Linear Off-Policy Actor-Critic , 2012, ICML.

[72]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[73]  Dale Schuurmans,et al.  Bridging the Gap Between Value and Policy Based Reinforcement Learning , 2017, NIPS.

[74]  Stefano Soatto,et al.  Toward Understanding Catastrophic Forgetting in Continual Learning , 2019, ArXiv.

[75]  Jorge Kurchan,et al.  Direct evaluation of large-deviation functions. , 2005, Physical review letters.

[76]  Anton Schwartz,et al.  A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[77]  V. Borkar Learning Algorithms for Risk-Sensitive Control , 2010 .

[78]  John N. Tsitsiklis,et al.  Average cost temporal-difference learning , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[79]  Stefan Schaal,et al.  Reinforcement Learning for Humanoid Robotics , 2003 .

[80]  Sergey Levine,et al.  Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review , 2018, ArXiv.

[81]  Takahiro Nemoto,et al.  Computation of large deviation statistics via iterative measurement-and-feedback procedure. , 2013, Physical review letters.

[82]  Freddy Bouchet,et al.  Population-dynamics method with a multicanonical feedback control. , 2016, Physical review. E.

[83]  Pawel Cichosz,et al.  Truncating Temporal Differences: On the Efficient Implementation of TD(lambda) for Reinforcement Learning , 1994, J. Artif. Intell. Res..

[84]  Richard S. Sutton,et al.  Comparing Policy-Gradient Algorithms , 2001 .

[85]  R. Sutton,et al.  Gradient temporal-difference learning algorithms , 2011 .

[86]  Hugo Touchette,et al.  Variational and optimal control representations of conditioned and driven processes , 2015, 1506.05291.

[87]  Garnet Kin-Lic Chan,et al.  Constructing auxiliary dynamics for nonequilibrium stationary states by variance minimization. , 2019, The Journal of chemical physics.

[88]  J. P. Garrahan,et al.  Rare behavior of growth processes via umbrella sampling of trajectories. , 2017, Physical review. E.

[89]  Peter L. Bartlett,et al.  Estimation and Approximation Bounds for Gradient-Based Reinforcement Learning , 2000, J. Comput. Syst. Sci..

[90]  J. P. Garrahan Classical stochastic dynamics and continuous matrix product states: gauge transformations, conditioned and driven processes, and equivalence of trajectory ensembles , 2016, 1602.07966.

[91]  David T. Limmer,et al.  Importance sampling large deviations in nonequilibrium steady states. I. , 2017, The Journal of chemical physics.

[92]  Austen Lamacraft,et al.  Quantum Ground States from Reinforcement Learning , 2020, MSML.

[93]  Rémi Munos,et al.  Policy Gradient in Continuous Time , 2006, J. Mach. Learn. Res..

[94]  J. P. Garrahan,et al.  A deep learning functional estimator of optimal dynamics for sampling large deviations , 2020, Mach. Learn. Sci. Technol..

[95]  Peter L. Bartlett,et al.  Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[96]  R Ratcliff,et al.  Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. , 1990, Psychological review.

[97]  Vicenç Gómez,et al.  A unified view of entropy-regularized Markov decision processes , 2017, ArXiv.

[98]  Christopher Amato,et al.  Efficient Eligibility Traces for Deep Reinforcement Learning , 2018, ArXiv.

[99]  Richard S. Sutton,et al.  Discounted Reinforcement Learning is Not an Optimization Problem , 2019, ArXiv.

[100]  Shalabh Bhatnagar,et al.  Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[101]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[102]  Shimon Whiteson,et al.  A theoretical and empirical analysis of Expected Sarsa , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[103]  Hamid Reza Maei,et al.  Convergent Actor-Critic Algorithms Under Off-Policy Training and Function Approximation , 2018, ArXiv.

[104]  G. Chan,et al.  Dynamical phase behavior of the single- and multi-lane asymmetric simple exclusion process via matrix product states. , 2019, Physical review. E.

[105]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[106]  M. Cates,et al.  Optimizing active work: Dynamical phase transitions, collective motion, and jamming. , 2018, Physical review. E.

[107]  David Chandler,et al.  Transition path sampling: throwing ropes over rough mountain passes, in the dark. , 2002, Annual review of physical chemistry.

[108]  Troels Arnfred Bojesen,et al.  Policy-guided Monte Carlo: Reinforcement-learning Markov chain dynamics , 2018, Physical Review E.

[109]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[110]  H. Touchette,et al.  Adaptive Sampling of Large Deviations , 2018, Journal of Statistical Physics.

[111]  Frank L. Lewis,et al.  Online actor critic algorithm to solve the continuous-time infinite horizon optimal control problem , 2009, 2009 International Joint Conference on Neural Networks.

[112]  J. P. Garrahan Aspects of non-equilibrium in classical and quantum systems: Slow relaxation and glasses, dynamical large deviations, quantum non-ergodicity, and open quantum dynamics , 2017, Physica A: Statistical Mechanics and its Applications.

[113]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[114]  Travis B Dick,et al.  Policy Gradient Reinforcement Learning Without Regret , 2015 .

[115]  Gerald Tesauro,et al.  Learning to Learn without Forgetting By Maximizing Transfer and Minimizing Interference , 2018, ICLR.

[116]  John N. Tsitsiklis,et al.  Approximate Gradient Methods in Policy-Space Optimization of Markov Reward Processes , 2003, Discret. Event Dyn. Syst..

[117]  Kenji Doya,et al.  Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[118]  Emanuel Todorov,et al.  Linearly-solvable Markov decision problems , 2006, NIPS.

[119]  J. P. Garrahan,et al.  Dynamic Order-Disorder in Atomistic Models of Structural Glass Formers , 2009, Science.

[120]  Rosalind J. Allen,et al.  Malliavin Weight Sampling: A Practical Guide , 2013, Entropy.

[121]  Philip Thomas,et al.  Bias in Natural Actor-Critic Algorithms , 2014, ICML.

[122]  Jakub Dolezal,et al.  Large deviations and optimal control forces for hard particles in one dimension , 2019, Journal of Statistical Mechanics: Theory and Experiment.

[123]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[124]  Emanuel Todorov,et al.  Efficient computation of optimal actions , 2009, Proceedings of the National Academy of Sciences.

[125]  Michael O. Duff,et al.  Reinforcement Learning Methods for Continuous-Time Markov Decision Problems , 1994, NIPS.

[126]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[127]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[128]  William T. Faricy A.A.R. , 1951 .

[129]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[130]  R. Jack,et al.  Finite-Size Scaling of a First-Order Dynamical Phase Transition: Adaptive Population Dynamics and an Effective Model. , 2016, Physical review letters.

[131]  Dale Schuurmans,et al.  Trust-PCL: An Off-Policy Trust Region Method for Continuous Control , 2017, ICLR.

[132]  Markus Heyl,et al.  Reinforcement Learning for Digital Quantum Simulation. , 2020, Physical review letters.