Sample Complexity Bounds of Exploration

Efficient exploration is widely recognized as a fundamental challenge inherent in reinforcement learning. Algorithms that explore efficiently converge faster to near-optimal policies. While heuristics techniques are popular in practice, they lack formal guarantees and may not work well in general. This chapter studies algorithms with polynomial sample complexity of exploration, both model-based and model-free ones, in a unified manner. These so-called PAC-MDP algorithms behave near-optimally except in a “small” number of steps with high probability. A new learning model known as KWIK is used to unify most existing model-based PAC-MDP algorithms for various subclasses of Markov decision processes.We also compare the sample-complexity framework to alternatives for formalizing exploration efficiency such as regret minimization and Bayes optimal solutions.

[1]  Csaba Szepesvári,et al.  Model-based reinforcement learning with nearly tight exploration complexity bounds , 2010, ICML.

[2]  Neil D. Lawrence,et al.  Missing Data in Kernel PCA , 2006, ECML.

[3]  Clayton T. Morrison,et al.  Blending Autonomous Exploration and Apprenticeship Learning , 2011, NIPS.

[4]  Claude-Nicolas Fiechter Expected Mistake Bound Model for On-Line Reinforcement Learning , 1997, ICML.

[5]  Michael L. Littman,et al.  A unifying framework for computational reinforcement learning theory , 2009 .

[6]  Paul Bourgine,et al.  Exploration of Multi-State Environments: Local Measures and Back-Propagation of Uncertainty , 1999, Machine Learning.

[7]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[8]  Michael Kearns,et al.  Efficient Reinforcement Learning in Factored MDPs , 1999, IJCAI.

[9]  Andre Cohen,et al.  An object-oriented representation for efficient reinforcement learning , 2008, ICML '08.

[10]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[11]  Alexander L. Strehl,et al.  Probably Approximately Correct (PAC) Exploration in Reinforcement Learning , 2008, ISAIM.

[12]  Andrew G. Barto,et al.  Optimal learning: computational procedures for bayes-adaptive markov decision processes , 2002 .

[13]  Michael L. Littman,et al.  Efficient Structure Learning in Factored-State MDPs , 2007, AAAI.

[14]  Michael L. Littman,et al.  Multi-resolution Exploration in Continuous Spaces , 2008, NIPS.

[15]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[16]  Jesse Hoey,et al.  An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[17]  Lihong Li,et al.  Reinforcement Learning in Finite MDPs: PAC Analysis , 2009, J. Mach. Learn. Res..

[18]  Steven D. Whitehead,et al.  Complexity and Cooperation in Q-Learning , 1991, ML.

[19]  D. Sofge THE ROLE OF EXPLORATION IN LEARNING CONTROL , 1992 .

[20]  Tommi S. Jaakkola,et al.  Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[21]  Thomas J. Walsh,et al.  Integrating Sample-Based Planning and Model-Based Reinforcement Learning , 2010, AAAI.

[22]  Csaba Szepesvári,et al.  Agnostic KWIK learning and efficient approximate reinforcement learning , 2011, COLT.

[23]  Csaba Szepesv Agnostic KWIK learning and ecient approximate reinforcement learning , 2011 .

[24]  Apostolos Burnetas,et al.  Optimal Adaptive Policies for Markov Decision Processes , 1997, Math. Oper. Res..

[25]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[26]  Andrew W. Moore,et al.  Prioritized sweeping: Reinforcement learning with less data and less time , 2004, Machine Learning.

[27]  Nick Littlestone,et al.  Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm , 2004, Machine Learning.

[28]  Ambuj Tewari,et al.  Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs , 2007, NIPS.

[29]  Leslie Pack Kaelbling,et al.  Learning in embedded systems , 1993 .

[30]  Ambuj Tewari,et al.  REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.

[31]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[32]  John Langford,et al.  Exploration in Metric State Spaces , 2003, ICML.

[33]  R. Simmons,et al.  The effect of representation and knowledge on goal-directed exploration with reinforcement-learning algorithms , 2004, Machine Learning.

[34]  Csaba Szepesvári,et al.  Online Markov Decision Processes Under Bandit Feedback , 2010, IEEE Transactions on Automatic Control.

[35]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[36]  Lihong Li,et al.  Incremental Model-based Learners With Formal Learning-Time Guarantees , 2006, UAI.

[37]  Jürgen Schmidhuber,et al.  Efficient model-based exploration , 1998 .

[38]  Lihong Li,et al.  Online exploration in least-squares policy iteration , 2009, AAMAS.

[39]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[40]  Michael L. Littman,et al.  Online Linear Regression and Its Application to Model-Based Reinforcement Learning , 2007, NIPS.

[41]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[42]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[43]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[44]  Reid G. Simmons,et al.  The Effect of Representation and Knowledge on Goal-Directed Exploration with Reinforcement-Learning Algorithms , 2005, Machine Learning.

[45]  Shie Mannor,et al.  PAC Bounds for Multi-armed Bandit and Markov Decision Processes , 2002, COLT.

[46]  András Lörincz,et al.  The many faces of optimism: a unifying approach , 2008, ICML '08.

[47]  Thomas J. Walsh,et al.  Exploring compact reinforcement-learning representations with linear regression , 2009, UAI.

[48]  Michael L. Littman,et al.  Dimension reduction and its application to model-based exploration in continuous spaces , 2010, Machine Learning.

[49]  Pieter Abbeel,et al.  Exploration and apprenticeship learning in reinforcement learning , 2005, ICML.

[50]  Dale Schuurmans,et al.  Algorithm-Directed Exploration for Model-Based Reinforcement Learning in Factored MDPs , 2002, ICML.

[51]  Doina Precup,et al.  Using MDP Characteristics to Guide Exploration in Reinforcement Learning , 2003, ECML.

[52]  Csaba Szepesv Model-based reinforcement learning with nearly tight exploration complexity bounds , 2010 .

[53]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[54]  Thomas J. Walsh,et al.  Generalizing Apprenticeship Learning across Hypothesis Classes , 2010, ICML.

[55]  Lihong Li,et al.  A Bayesian Sampling Approach to Exploration in Reinforcement Learning , 2009, UAI.

[56]  Morteza Zadimoghaddam,et al.  Trading off Mistakes and Don't-Know Predictions , 2010, NIPS.

[57]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[58]  Hendrik Blockeel,et al.  Machine Learning: ECML 2003 , 2003, Lecture Notes in Computer Science.

[59]  Andrew Y. Ng,et al.  Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[60]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[61]  Nicholas Roy,et al.  Provably Efficient Learning with Typed Parametric Models , 2009, J. Mach. Learn. Res..

[62]  Alexander L. Strehl,et al.  Model-Based Reinforcement Learning in Factored-State MDPs , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[63]  Peter Stone,et al.  Model-based function approximation in reinforcement learning , 2007, AAMAS '07.

[64]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[65]  Michael L. Littman,et al.  An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[66]  David Andre,et al.  Model based Bayesian Exploration , 1999, UAI.

[67]  Thomas J. Walsh,et al.  Knows what it knows: a framework for self-aware learning , 2008, ICML '08.

[68]  Michael Kearns,et al.  Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , 1998, NIPS.

[69]  Claude-Nicolas Fiechter,et al.  Efficient reinforcement learning , 1994, COLT '94.

[70]  Yishay Mansour,et al.  Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..

[71]  Lihong Li,et al.  Reducing reinforcement learning to KWIK online regression , 2010, Annals of Mathematics and Artificial Intelligence.

[72]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[73]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[74]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[75]  Thomas J. Walsh,et al.  Efficient learning of relational models for sequential decision making , 2010 .

[76]  Lihong Li,et al.  The adaptive k-meteorologists problem and its application to structure learning and feature selection in reinforcement learning , 2009, ICML '09.