Scaling Up Average Reward Reinforcement Learning by Approximating the Domain Models and the Value Function

Almost all the work in Average-reward Reinforcement Learning (ARL) so far has fo-cused on table-based methods which do not scale to domains with large state spaces. In this paper, we propose two extensions to a model-based ARL method called H-learning to address the scale-up problem. We extend H-learning to learn action models and reward functions in the form of Bayesian networks, and approximate its value function using local linear regression. We test our algorithms on several scheduling tasks for a simulated Automatic Guided Vehicle (AGV) and show that they are eeective in signiicantly reducing the space requirement of H-learning and making it converge faster. To the best of our knowledge, our results are the rst in applying function approximation to ARL.

[1]  George C. Canavos,et al.  Applied probability and statistical methods , 1984 .

[2]  Keiji Kanazawa,et al.  A model for reasoning about persistence and causation , 1989 .

[3]  A. Jalali,et al.  Computationally efficient adaptive control algorithms for Markov chains , 1989, Proceedings of the 28th IEEE Conference on Decision and Control,.

[4]  Andrew W. Moore,et al.  Efficient memory-based learning for robot control , 1990 .

[5]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[6]  Ann E. Nicholson,et al.  The Data Association Problem when Monitoring Robot Vehicles Using Dynamic Belief Networks , 1992, ECAI.

[7]  Sridhar Mahadevan,et al.  Automatic Programming of Behavior-Based Robots Using Reinforcement Learning , 1991, Artif. Intell..

[8]  Uffe Kjærulff,et al.  A Computational Scheme for Reasoning in Dynamic Probabilistic Networks , 1992, UAI.

[9]  Anton Schwartz,et al.  A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[10]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[11]  Satinder P. Singh,et al.  Reinforcement Learning Algorithms for Average-Payoff Markovian Decision Processes , 1994, AAAI.

[12]  S. Schaal,et al.  Robot juggling: implementation of memory-based learning , 1994, IEEE Control Systems.

[13]  Prasad Tadepalli,et al.  H-Learning: A Reinforcement Learning Method for Optimizing Undiscounted Average Reward , 1994 .

[14]  Craig Boutilier,et al.  Process-Oriented Planning and Average-Reward Optimality , 1995, IJCAI.

[15]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[16]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[17]  Craig Boutilier,et al.  Exploiting Structure in Policy Construction , 1995, IJCAI.

[18]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[19]  Prasad Tadepalli,et al.  Auto-Exploratory Average Reward Reinforcement Learning , 1996, AAAI/IAAI, Vol. 1.

[20]  Sridhar Mahadevan,et al.  Sensitive Discount Optimality: Unifying Discounted and Average Reward Reinforcement Learning , 1996, ICML.