Focus of Attention in Reinforcement Learning

Classification-based reinforcement learning (RL) methods have recently been proposed as an alternative to the traditional value-function based methods. These methods use a classifier to represent a policy, where the input (features) to the classifier is the state and the output (class label) for that state is the desired action. The reinforcement-learning community knows that focusing on more important states can lead to improved performance. In this paper, we investigate the idea of focused learning in the context of classification-based RL. Specifically, we define a useful notation of state importance, which we use to prove rigorous bounds on policy loss. Furthermore, we show that a classification-based RL agent may behave arbitrarily poorly if it treats all states as equally important.

[1]  Bruce A. Draper,et al.  ADORE: Adaptive Object Recognition , 1999, ICVS.

[2]  David G. Stork,et al.  Pattern Classification , 1973 .

[3]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[4]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[5]  Wei Zhang,et al.  A Reinforcement Learning Approach to job-shop Scheduling , 1995, IJCAI.

[6]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[7]  Russell Greiner,et al.  Why Experimentation can be better than "Perfect Guidance" , 1997, ICML.

[8]  Gerald Tesauro,et al.  Practical Issues in Temporal Difference Learning , 1992, Mach. Learn..

[9]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[10]  J. Baxter,et al.  Direct gradient-based reinforcement learning , 2000, 2000 IEEE International Symposium on Circuits and Systems. Emerging Technologies for the 21st Century. Proceedings (IEEE Cat No.00CH36353).

[11]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[12]  P. Bartlett,et al.  Direct Gradient-Based Reinforcement Learning: II. Gradient Ascent Algorithms and Experiments , 1999 .

[13]  Robert Givan,et al.  Inductive Policy Selection for First-Order MDPs , 2002, UAI.

[14]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[15]  Gerald Tesauro,et al.  On-line Policy Improvement using Monte-Carlo Search , 1996, NIPS.

[16]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[17]  Ben Tse,et al.  Autonomous Inverted Helicopter Flight via Reinforcement Learning , 2004, ISER.

[18]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[19]  John Langford,et al.  Relating reinforcement learning performance to classification performance , 2005, ICML '05.

[20]  Yishay Mansour,et al.  Approximate Planning in Large POMDPs via Reusable Trajectories , 1999, NIPS.

[21]  Vadim Bulitko,et al.  Machine Learning for Adaptive Image Interpretation , 2004, AAAI.

[22]  S. Shankar Sastry,et al.  Autonomous Helicopter Flight via Reinforcement Learning , 2003, NIPS.

[23]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[24]  Robert Givan,et al.  Approximate Policy Iteration with a Policy Language Bias , 2003, NIPS.

[25]  Vadim Bulitko,et al.  Batch Reinforcement Learning with State Importance , 2004, ECML.

[26]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[27]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[28]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[29]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[30]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[31]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[32]  Satinder Singh,et al.  An upper bound on the loss from approximate optimal-value functions , 1994, Machine Learning.

[33]  Michail G. Lagoudakis,et al.  Reinforcement Learning as Classification: Leveraging Modern Classifiers , 2003, ICML.

[34]  Andrew W. Moore,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[35]  S. Singh,et al.  Optimizing Dialogue Management with Reinforcement Learning: Experiments with the NJFun System , 2011, J. Artif. Intell. Res..

[36]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[37]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[38]  Hamid R. Berenji,et al.  A convergent actor-critic-based FRL algorithm with application to power management of wireless transmitters , 2003, IEEE Trans. Fuzzy Syst..

[39]  Preben Alstrøm,et al.  Learning to Drive a Bicycle Using Reinforcement Learning and Shaping , 1998, ICML.

[40]  Xin Wang,et al.  Batch Value Function Approximation via Support Vectors , 2001, NIPS.

[41]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[42]  Peter D. Turney Types of Cost in Inductive Concept Learning , 2002, ArXiv.

[43]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..