Efficient model-based exploration in continuous state-space environments

The impetus for exploration in reinforcement learning (RL) is decreasing uncertainty about the environment for the purpose of better decision making. As such, exploration plays a crucial role in the efficiency of RL algorithms. In this dissertation, I consider continuous state control problems and introduce a new methodology for representing uncertainty that engenders more efficient algorithms. I argue that the new notion of uncertainty allows for more efficient use of function approximation, which is essential for learning in continuous spaces. In particular, I focus on a class of algorithms referred to as model-based methods and develop several such algorithms that are much more efficient than the current state-of-the-art methods. These algorithms attack the long-standing “curse of dimensionality” – learning complexity often scales exponentially with problem dimensionality. I introduce algorithms that can exploit the dependency structure between state variables to exponentially decrease the sample complexity of learning, both in cases where the dependency structure is provided by the user a priori and cases where the algorithm has to find it on its own. I also use the new uncertainty notion to derive a multi-resolution exploration scheme, and demonstrate how this new technique achieves anytime behavior, which is very important in real-life applications. Finally, using a set of rich experiments, I show how the new exploration mechanisms affect the efficiency of learning, especially in real-life domains where acquiring samples is expensive.

[1]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[2]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[3]  Andrew Moore,et al.  Barycentric Interpolators for Continuous Space & Time Reinforcement Learning Category : Reinforcement Learning and Control , 1998 .

[4]  Michael L. Littman,et al.  A theoretical analysis of Model-Based Interval Estimation , 2005, ICML.

[5]  Dimitri Bertsekas,et al.  Distributed dynamic programming , 1981, 1981 20th IEEE Conference on Decision and Control including the Symposium on Adaptive Processes.

[6]  Fausto Giunchiglia,et al.  A Theory of Abstraction , 1992, Artif. Intell..

[7]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[8]  S. D. Jong SIMPLS: an alternative approach to partial least squares regression , 1993 .

[9]  Michael L. Littman,et al.  An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[10]  Michael L. Littman,et al.  A unifying framework for computational reinforcement learning theory , 2009 .

[11]  Lihong Li,et al.  A Bayesian Sampling Approach to Exploration in Reinforcement Learning , 2009, UAI.

[12]  Hiroshi Motoda,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998, The Springer International Series in Engineering and Computer Science.

[13]  M. C. Jones,et al.  A reliable data-based bandwidth selection method for kernel density estimation , 1991 .

[14]  Lihong Li,et al.  Reinforcement Learning in Finite MDPs: PAC Analysis , 2009, J. Mach. Learn. Res..

[15]  Thomas J. Walsh,et al.  Learning and planning in environments with delayed feedback , 2009, Autonomous Agents and Multi-Agent Systems.

[16]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[17]  Ronald J. Williams,et al.  Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions , 1993 .

[18]  Michael L. Littman,et al.  PAC-MDP Reinforcement Learning with Bayesian Priors , 2009 .

[19]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[20]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[21]  R. Christensen,et al.  Fisher Lecture: Dimension Reduction in Regression , 2007, 0708.3774.

[22]  Jarkko Venna,et al.  Local multidimensional scaling , 2006, Neural Networks.

[23]  Kilian Q. Weinberger,et al.  Metric Learning for Kernel Regression , 2007, AISTATS.

[24]  Michael L. Littman,et al.  Dimension reduction and its application to model-based exploration in continuous spaces , 2010, Machine Learning.

[25]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[26]  Thomas J. Walsh,et al.  Planning and Learning in Environments with Delayed Feedback , 2007, ECML.

[27]  Kenji Doya,et al.  Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[28]  John Langford,et al.  Exploration in Metric State Spaces , 2003, ICML.

[29]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[30]  William D. Smart Explicit Manifold Representations for Value-Function Approximation in Reinforcement Learning , 2004, ISAIM.

[31]  Claude-Nicolas Fiechter Expected Mistake Bound Model for On-Line Reinforcement Learning , 1997, ICML.

[32]  Pravin Varaiya,et al.  Stochastic Systems: Estimation, Identification, and Adaptive Control , 1986 .

[33]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, CACM.

[34]  T. Warren Liao,et al.  Clustering of time series data - a survey , 2005, Pattern Recognit..

[35]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[36]  Andrew W. Moore,et al.  Memory-Based Reinforcement Learning: Efficient Computation with Prioritized Sweeping , 1992, NIPS.

[37]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[38]  Sridhar Mahadevan,et al.  Learning Representation and Control in Markov Decision Processes: New Frontiers , 2009, Found. Trends Mach. Learn..

[39]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[40]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[41]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[42]  B. Kowalski,et al.  Partial least-squares regression: a tutorial , 1986 .

[43]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[44]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[45]  Ronald E. Parr,et al.  A Novel Benchmark Methodology and Data Repository for Real-life Reinforcement Learning , 2009 .

[46]  Sylvain Gelly,et al.  Exploration exploitation in Go: UCT for Monte-Carlo Go , 2006, NIPS 2006.

[47]  J. Peng,et al.  Efficient Learning and Planning Within the Dyna Framework , 1993, IEEE International Conference on Neural Networks.

[48]  Michael L. Littman,et al.  Online Linear Regression and Its Application to Model-Based Reinforcement Learning , 2007, NIPS.

[49]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[50]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[51]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[52]  D. Moore Simplicial Mesh Generation with Applications , 1992 .

[53]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[54]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[55]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[56]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[57]  Alexander L. Strehl,et al.  Probably Approximately Correct (PAC) Exploration in Reinforcement Learning , 2008, ISAIM.

[58]  Lihong Li,et al.  An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning , 2008, ICML '08.

[59]  Guy Van den Broeck,et al.  Monte-Carlo Tree Search in Poker Using Expected Reward Distributions , 2009, ACML.

[60]  Thomas J. Walsh,et al.  Towards a Unified Theory of State Abstraction for MDPs , 2006, AI&M.

[61]  Sebastian Thrun,et al.  The role of exploration in learning control , 1992 .

[62]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[63]  Richard S. Sutton,et al.  Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[64]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[65]  Peter Stone,et al.  Model-Based Exploration in Continuous State Spaces , 2007, SARA.

[66]  David Silver,et al.  Combining online and offline knowledge in UCT , 2007, ICML '07.

[67]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[68]  T. Dean,et al.  Planning under uncertainty: structural assumptions and computational leverage , 1996 .

[69]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[70]  H. Sebastian Seung,et al.  Selective Sampling Using the Query by Committee Algorithm , 1997, Machine Learning.

[71]  Michael L. Littman,et al.  Algorithms for Sequential Decision Making , 1996 .

[72]  Doina Precup,et al.  Bounding Performance Loss in Approximate MDP Homomorphisms , 2008, NIPS.

[73]  Peter Auer,et al.  Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning , 2006, NIPS.

[74]  John N. Tsitsiklis,et al.  The complexity of dynamic programming , 1989, J. Complex..

[75]  Mikhail Belkin,et al.  Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering , 2001, NIPS.

[76]  Dale Schuurmans,et al.  Algorithm-Directed Exploration for Model-Based Reinforcement Learning in Factored MDPs , 2002, ICML.

[77]  Stefan Schaal,et al.  Approximate nearest neighbor regression in very high dimensions , 2006 .

[78]  Michael Kearns,et al.  Efficient Reinforcement Learning in Factored MDPs , 1999, IJCAI.

[79]  Benjamin Van Roy Performance Loss Bounds for Approximate Value Iteration with State Aggregation , 2006, Math. Oper. Res..

[80]  Lihong Li,et al.  Online exploration in least-squares policy iteration , 2009, AAMAS.

[81]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[82]  Peter Stone,et al.  Gaussian Processes for Sample Efficient Reinforcement Learning with RMAX-like Exploration , 2010, ECML/PKDD.

[83]  John N. Tsitsiklis,et al.  Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[84]  Nicholas Roy,et al.  Provably Efficient Learning with Typed Parametric Models , 2009, J. Mach. Learn. Res..

[85]  Csaba Szepesvári,et al.  Online Optimization in X-Armed Bandits , 2008, NIPS.

[86]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[87]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[88]  Lihong Li,et al.  Incremental Model-based Learners With Formal Learning-Time Guarantees , 2006, UAI.

[89]  Andrew Y. Ng,et al.  Regularization and feature selection in least-squares temporal difference learning , 2009, ICML '09.

[90]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[91]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[92]  Thomas J. Walsh,et al.  Knows what it knows: a framework for self-aware learning , 2008, ICML.

[93]  E. M. Wright,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[94]  Claude-Nicolas Fiechter,et al.  Efficient reinforcement learning , 1994, COLT '94.

[95]  Andrew W. Moore,et al.  Variable Resolution Discretization in Optimal Control , 2002, Machine Learning.

[96]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[97]  W. Gasarch,et al.  The Book Review Column 1 Coverage Untyped Systems Simple Types Recursive Types Higher-order Systems General Impression 3 Organization, and Contents of the Book , 2022 .

[98]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[99]  Michael Ian Shamos,et al.  Computational geometry: an introduction , 1985 .

[100]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[101]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[102]  Audra E. Kosh,et al.  Linear Algebra and its Applications , 1992 .

[103]  Michael I. Jordan,et al.  Kernel dimension reduction in regression , 2009, 0908.1854.

[104]  Louis Wehenkel,et al.  Reinforcement Learning Versus Model Predictive Control: A Comparison on a Power System Problem , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[105]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[106]  Michael L. Littman,et al.  Multi-resolution Exploration in Continuous Spaces , 2008, NIPS.

[107]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .