A Generalization Error for Q-Learning

Planning problems that involve learning a policy from a single training set of finite horizon trajectories arise in both social science and medical fields. We consider Q-learning with function approximation for this setting and derive an upper bound on the generalization error. This upper bound is in terms of quantities minimized by a Q-learning algorithm, the complexity of the approximation space and an approximation term due to the mismatch between Q-learning and the goal of learning a policy that maximizes the value function.

[1]  R. Bellman Dynamic programming. , 1957, Science.

[2]  Claude-Nicolas Fiechter,et al.  Efficient reinforcement learning , 1994, COLT '94.

[3]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[4]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[5]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[6]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[7]  Claude-Nicolas Fiechter Expected Mistake Bound Model for On-Line Reinforcement Learning , 1997, ICML.

[8]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[9]  Douglas H. Fisher,et al.  Proceedings of the Fourteenth International Conference on Machine Learning (ICML 1997), Nashville, Tennessee, USA, July 8-12, 1997 , 1997, ICML.

[10]  Yishay Mansour,et al.  Approximate Planning in Large POMDPs via Reusable Trajectories , 1999, NIPS.

[11]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[12]  H. Sung,et al.  Evaluating multiple treatment courses in clinical trials. , 2000, Statistics in medicine.

[13]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[14]  K. Davis,et al.  National Institute of Mental Health Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE): Alzheimer disease trial methodology. , 2001, The American journal of geriatric psychiatry : official journal of the American Association for Geriatric Psychiatry.

[15]  M. Altfeld,et al.  Less is more? STI in acute and chronic HIV-1 infection , 2001, Nature Medicine.

[16]  H. Sung,et al.  Selecting Therapeutic Strategies Based on Efficacy and Death in Multicourse Clinical Trials , 2002 .

[17]  R. Brooner,et al.  Using Behavioral Reinforcement To Improve Methadone Treatment Participation , 2002, Science & practice perspectives.

[18]  Leonid Peshkin,et al.  Learning from Scarce Experience , 2002, ICML.

[19]  Alexander L. Miller,et al.  Texas Medication Algorithm Project, phase 3 (TMAP-3): rationale and study design. , 2003, The Journal of clinical psychiatry.

[20]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[21]  D. Kupfer,et al.  Background and rationale for the sequenced treatment alternatives to relieve depression (STAR*D) study. , 2003, The Psychiatric clinics of North America.

[22]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[23]  John N. Tsitsiklis,et al.  Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[24]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 2005, IEEE Transactions on Neural Networks.

[25]  M. Plotkin Nature as medicine. , 2005, Explore.

[26]  Liming Xiang,et al.  Kernel-Based Reinforcement Learning , 2006, ICIC.

[27]  John N. Tsitsiklis,et al.  Dynamic Catalog Mailing Policies , 2006, Manag. Sci..