Learning curve bounds for a Markov decision process with undiscounted rewards

The goal of learning in Markov decision processes is to find a policy that yields the maximum expected return over time. In problems with large state spaces, computing these averages directly is not feasible; instead, the agent must estimate them by stochastic exploration of the state space. Using methods from statistical mechanics, we study how the agent’s performance depends on the allowed exploration time. In particular, for a simple control problem with undiscounted rewards, we compute a lower bound on the return of policies that appear optimal based on imperfect statistics. This is done in the thermodynamic limit: T ~ co, N 4 cO, o = T/A’ (finite), where T is the number of time steps allotted per policy evaluation and N is the size of the state space.

[1]  Claude-Nicolas Fiechter,et al.  Efficient reinforcement learning , 1994, COLT '94.

[2]  Lawrence K. Saul,et al.  Markov decision processes in large state spaces , 1995, COLT '95.

[3]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[4]  Sompolinsky,et al.  Statistical mechanics of learning from examples. , 1992, Physical review. A, Atomic, molecular, and optical physics.

[5]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[6]  T. Watkin,et al.  THE STATISTICAL-MECHANICS OF LEARNING A RULE , 1993 .

[7]  Satinder P. Singh,et al.  Reinforcement Learning Algorithms for Average-Payoff Markovian Decision Processes , 1994, AAAI.

[8]  E. M.,et al.  Statistical Mechanics , 2021, On Complementarity.

[9]  Stanley J. Rosenschein,et al.  Learning to act using real-time dynamic programming , 1996 .

[10]  David Haussler,et al.  Rigorous Learning Curve Bounds from Statistical Mechanics , 1994, COLT.

[11]  M. Marcus,et al.  A Survey of Matrix Theory and Matrix Inequalities , 1965 .

[12]  D. Haussler,et al.  Rigorous Learning Curve Bounds from Statistical Mechanics , 1994, COLT '94.

[13]  C. Fiechter Eecient Reinforcement Learning , 1994 .

[14]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[15]  G. Parmigiani Large Deviation Techniques in Decision, Simulation and Estimation , 1992 .

[16]  James A. Bucklew,et al.  Large Deviation Techniques in Decision, Simulation, and Estimation , 1990 .