On Discontinuous Q-Functions in Reinforcment Learning

This paper considers the application of reinforcement learning to path finding tasks in continuous state space in the presence of obstacles. We show that cumulative evaluation functions (as Q-Functions [28] and V-Functions [4]) may be discontinuous if forbidden regions (as implied by obstacles) exist in state space. As the infinite number of states requires the use of function approximators such as backpropagation nets [16, 12, 24], we argue that these discontinuities imply severe difficulties in learning cumulative evaluation functions. The discontinuities we detected might also explain why recent applications of reinforcement learning systems to complex tasks [12] failed to show desired performance. In our conclusion, we outline some ideas to circumvent the problem.

[1]  Charles W. Anderson,et al.  Learning and problem-solving with multilayer connectionist systems (adaptive, strategy learning, neural networks, reinforcement learning) , 1986 .

[2]  Marvin Minsky,et al.  Steps toward Artificial Intelligence , 1995, Proceedings of the IRE.

[3]  Patchigolla Kiran Kumar,et al.  A Survey of Some Results in Stochastic Adaptive Control , 1985 .

[4]  A. Barto,et al.  Learning and Sequential Decision Making , 1989 .

[5]  P. J. Werbos,et al.  Backpropagation and neurocontrol: a review and prospectus , 1989, International 1989 Joint Conference on Neural Networks.

[6]  Paul J. Werbos,et al.  Consistency of HDP applied to a simple reinforcement learning problem , 1990, Neural Networks.

[7]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[8]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[9]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[10]  Sebastian Thrun,et al.  Efficient Exploration In Reinforcement Learning , 1992 .

[11]  F. J. Śmieja,et al.  Multiple Network Systems (Minos) Modules: Task Division and Module Discrimination , 1991 .

[12]  Richard S. Sutton,et al.  Learning and Sequential Decision Making , 1989 .

[13]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[14]  A G Barto,et al.  Simulation Experiments with Goal-Seeking Adaptive Elements. , 1984 .

[15]  G. Tesauro Practical Issues in Temporal Difference Learning , 1992 .

[16]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[17]  Geoffrey E. Hinton Connectionist Learning Procedures , 1989, Artif. Intell..

[18]  Dana H. Ballard,et al.  Active Perception and Reinforcement Learning , 1990, Neural Computation.

[19]  Paul J. Werbos,et al.  An Empirical Test of New Forecasting Methods Derived from a Theory of Intelligence: The Prediction of Conflict in Latin America , 1978, IEEE Transactions on Systems, Man, and Cybernetics.

[20]  P. Dayan The Convergence of TD(λ) for General λ , 2004, Machine Learning.

[21]  Dieter Fox,et al.  Learning By Error-Driven Decomposition , 1991 .

[22]  A. L. Samuel,et al.  Some studies in machine learning using the game of checkers. II: recent progress , 1967 .

[23]  A. L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[24]  Gerald Tesauro,et al.  Practical Issues in Temporal Difference Learning , 1992, Mach. Learn..

[25]  Richard E. Korf,et al.  Real-time heuristic search: new results , 1988, AAAI 1988.

[26]  Richard S. Sutton,et al.  Reinforcement Learning is Direct Adaptive Optimal Control , 1992, 1991 American Control Conference.