Relating reinforcement learning performance to classification performance

We prove a quantitative connection between the expected sum of rewards of a policy and binary classification performance on created subproblems. This connection holds without any unobservable assumptions (no assumption of independence, small mixing time, fully observable states, or even hidden states) and the resulting statement is independent of the number of states or actions. The statement is critically dependent on the size of the rewards and prediction performance of the created classifiers.We also provide some general guidelines for obtaining good classification performance on the created subproblems. In particular, we discuss possible methods for generating training examples for a classifier learning algorithm.

[1]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[2]  Thomas P. Hayes,et al.  Reductions Between Classification Tasks , 2004, Electron. Colloquium Comput. Complex..

[3]  J. Langford,et al.  Reducing T-step reinforcement learning to classifica-tion , 2003 .

[4]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[5]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[6]  Jeff G. Schneider,et al.  Policy Search by Dynamic Programming , 2003, NIPS.

[7]  Yishay Mansour,et al.  Reinforcement Learning in POMDPs Without Resets , 2005, IJCAI.

[8]  Yishay Mansour,et al.  Approximate Planning in Large POMDPs via Reusable Trajectories , 1999, NIPS.

[9]  Michail G. Lagoudakis,et al.  Reinforcement Learning as Classification: Leveraging Modern Classifiers , 2003, ICML.

[10]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[11]  Michael R. James,et al.  Predictive State Representations: A New Theory for Modeling Dynamical Systems , 2004, UAI.

[12]  John Langford,et al.  Sensitive Error Correcting Output Codes , 2005, COLT.

[13]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[14]  Robert Givan,et al.  Approximate Policy Iteration with a Policy Language Bias , 2003, NIPS.

[15]  Roni Khardon,et al.  Learning to Take Actions , 1996, Machine Learning.

[16]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[17]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[18]  John Langford,et al.  Estimating Class Membership Probabilities using Classifier Learners , 2005, AISTATS.

[19]  John Langford,et al.  Cost-sensitive learning by cost-proportionate example weighting , 2003, Third IEEE International Conference on Data Mining.