Making Reinforcement Learning Work on Real Robots

Programming robots is hard. It often takes a great deal of time to fine-tune the many parameters in a typical control algorithm. For some robot tasks, we may not even know a good solution without extensive experimentation. Even when we, as humans, have good intuitions about how to perform a given task, it is often difficult to translate these into the sensor and actuator spaces of the robot. Having the robot learn how to perform a given task is one way of addressing these problems. Specifying what the robot should be doing, and allowing it to fill in the details of how using learning is an appealing idea. In general, describing a task at a higher, more behavioral level is easier for humans than having to specifying the exact mapping from sensors to actuators that defines a control policy. In particular, reinforcement learning is a very promising paradigm for learning on real robots. However, simply applying existing reinforcement learning techniques will almost certainly lead to failure. Issues such as large, continuous state and action spaces, extremely limited amounts of training data, lack of initial knowledge about the task and environment, and the necessity of keeping the robot physically safe during learning must be explicitly addressed if learning is to succeed. In this dissertation, we identify some of the problems that must be overcome when attempting to implement a reinforcement learning system on a real mobile robot. We discuss some solutions to these problems and present two components that, together, allow us to use reinforcement learning techniques effectively on a real robot. HEDGER is a safe value-function approximation algorithm designed to be used with continuous state and action spaces, and with sparse reward functions. JAQL is our general framework for reinforcement learning on real robots, and deals with the problems of initial knowledge and robot safety. We validate the effectiveness of both components using a variety of simulated and real robot task domains.

[1]  R. Bellman Dynamic programming. , 1957, Science.

[2]  N. Draper,et al.  Applied Regression Analysis , 1966 .

[3]  Richard O. Duda,et al.  Use of the Hough transformation to detect lines and curves in pictures , 1972, CACM.

[4]  James S. Albus,et al.  I A New Approach to Manipulator Control: The I Cerebellar Model Articulation Controller , 1975 .

[5]  James S. Albus,et al.  Data Storage in the Cerebellar Model Articulation Controller (CMAC) , 1975 .

[6]  James S. Albus,et al.  New Approach to Manipulator Control: The Cerebellar Model Articulation Controller (CMAC)1 , 1975 .

[7]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[8]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[9]  Bruce G. Batchelor,et al.  Pattern Recognition: Ideas in Practice , 1978 .

[10]  R. Cook Influential Observations in Linear Regression , 1979 .

[11]  W. W. Muir,et al.  Regression Diagnostics: Identifying Influential Data and Sources of Collinearity , 1980 .

[12]  Jon Louis Bentley,et al.  Multidimensional divide-and-conquer , 1980, CACM.

[13]  James S. Albus,et al.  Brains, behavior, and robotics , 1981 .

[14]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[15]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[16]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[17]  Dimitri P. Bertsekas,et al.  Dynamic Programming: Deterministic and Stochastic Models , 1987 .

[18]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[19]  David Chapman,et al.  Pengi: An Implementation of a Theory of Activity , 1987, AAAI.

[20]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[21]  C. Watkins Learning from delayed rewards , 1989 .

[22]  Michael McCloskey,et al.  Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , 1989 .

[23]  Rodney A. Brooks,et al.  Learning to Coordinate Behaviors , 1990, AAAI.

[24]  R Ratcliff,et al.  Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. , 1990, Psychological review.

[25]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[26]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[27]  A. Moore Variable Resolution Dynamic Programming , 1991, ML.

[28]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[29]  J. Friedman Multivariate adaptive regression splines , 1990 .

[30]  Thomas G. Dietterich,et al.  Learning with Many Irrelevant Features , 1991, AAAI.

[31]  Steven D. Whitehead,et al.  A Complexity Analysis of Cooperative Mechanisms in Reinforcement Learning , 1991, AAAI.

[32]  Paul E. Utgoff,et al.  Two Kinds of Training Information For Evaluation Function Learning , 1991, AAAI.

[33]  Gerald Tesauro,et al.  Practical Issues in Temporal Difference Learning , 1992, Mach. Learn..

[34]  Sridhar Mahadevan,et al.  Automatic Programming of Behavior-Based Robots Using Reinforcement Learning , 1991, Artif. Intell..

[35]  Larry A. Rendell,et al.  The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[36]  Paul E. Utgoff,et al.  A Teaching Method for Reinforcement Learning , 1992, ML.

[37]  Jing Peng,et al.  Efficient Learning and Planning Within the Dyna Framework , 1993, Adapt. Behav..

[38]  Dean A. Pomerleau,et al.  Neural Network Perception for Mobile Robot Guidance , 1993 .

[39]  Tom M. Mitchell,et al.  An Apprentice-Based Approach to Knowledge Acquisition , 1993, Artif. Intell..

[40]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[41]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[42]  R. Palmer,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[43]  Marco Colombetti,et al.  Robot Shaping: Developing Autonomous Agents Through Learning , 1994, Artif. Intell..

[44]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[45]  Masayuki Inaba,et al.  Learning by watching: extracting reusable task knowledge from visual observation of human performance , 1994, IEEE Trans. Robotics Autom..

[46]  Stefan Schaal,et al.  Robot learning by nonparametric regression , 1994, Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS'94).

[47]  Stefan Schaal,et al.  Memory-based robot learning , 1994, Proceedings of the 1994 IEEE International Conference on Robotics and Automation.

[48]  Benjamin Van Roy,et al.  Feature-based methods for large scale dynamic programming , 1995 .

[49]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[50]  Stefan Schaal,et al.  From Isolation to Cooperation: An Alternative View of a System of Experts , 1995, NIPS.

[51]  Thomas G. Dietterich,et al.  High-Performance Job-Shop Scheduling With A Time-Delay TD(λ) Network , 1995, NIPS 1995.

[52]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[53]  Wei Zhang,et al.  A Reinforcement Learning Approach to job-shop Scheduling , 1995, IJCAI.

[54]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[55]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[56]  Hobart R. Everett,et al.  Sensors for Mobile Robots: Theory and Application , 1995 .

[57]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[58]  Andrew W. Moore,et al.  Learning Evaluation Functions for Large Acyclic Domains , 1996, ICML.

[59]  Timothy M. Chan Output-sensitive results on convex hulls, extreme points, and related problems , 1996, Discret. Comput. Geom..

[60]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[61]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[62]  Wei Zhang,et al.  Reinforcement learning for job shop scheduling , 1996 .

[63]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[64]  Dimitri P. Bertsekas,et al.  Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems , 1996, NIPS.

[65]  D. Randall Wilson,et al.  Advances in instance-based learning algorithms , 1997 .

[66]  Stefan Schaal,et al.  Robot Learning From Demonstration , 1997, ICML.

[67]  Ashwin Ram,et al.  Experiments with Reinforcement Learning in Problems with Continuous State and Action Spaces , 1997, Adapt. Behav..

[68]  Michael Kaiser,et al.  Transfer of Elementary Skills via Human-Robot Interaction , 1997, Adapt. Behav..

[69]  Maja J. Mataric,et al.  Reinforcement Learning in the Multi-Robot Domain , 1997, Auton. Robots.

[70]  Andrew W. Moore,et al.  Efficient Locally Weighted Polynomial Regression Predictions , 1997, ICML.

[71]  Doina Precup,et al.  Exponentiated Gradient Methods for Reinforcement Learning , 1997, ICML.

[72]  Jude W. Shavlik,et al.  Creating advice-taking reinforcement learners , 1998 .

[73]  Andrew W. Moore,et al.  Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[74]  Geoffrey J. Gordon,et al.  Approximate solutions to markov decision processes , 1999 .

[75]  Andrew W. Moore,et al.  A Nonparametric Approach to Noisy and Costly Optimization , 2000, ICML.

[76]  Michael R. M. Jenkin,et al.  Computational principles of mobile robotics , 2000 .

[77]  W. Smart,et al.  Practical Reinforcement Learning , 2000, ICML 2000.

[78]  Martin C. Martin,et al.  Visual obstacle avoidance using genetic programming: first results , 2001 .

[79]  Minoru Asada,et al.  Purposive behavior acquisition for a real robot by vision-based reinforcement learning , 1995, Machine Learning.

[80]  Peter Dayan,et al.  The convergence of TD(λ) for general λ , 1992, Machine Learning.

[81]  Andrew W. Moore,et al.  The Parti-game Algorithm for Variable Resolution Reinforcement Learning in Multidimensional State-spaces , 1993, Machine Learning.

[82]  Andrew W. Moore,et al.  Locally Weighted Learning for Control , 1997, Artificial Intelligence Review.

[83]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[84]  Richard S. Sutton,et al.  Reinforcement learning with replacing eligibility traces , 2004, Machine Learning.

[85]  John N. Tsitsiklis,et al.  Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[86]  Andrew W. Moore,et al.  Locally Weighted Learning , 1997, Artificial Intelligence Review.

[87]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[88]  R. Simmons,et al.  The effect of representation and knowledge on goal-directed exploration with reinforcement-learning algorithms , 2004, Machine Learning.

[89]  Andrew G. Barto,et al.  Elevator Group Control Using Multiple Reinforcement Learning Agents , 1998, Machine Learning.

[90]  Rémi Munos,et al.  A Study of Reinforcement Learning in the Continuous Case by the Means of Viscosity Solutions , 2000, Machine Learning.

[91]  Andrew W. Moore,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[92]  Jing Peng,et al.  Incremental multi-step Q-learning , 1994, Machine Learning.