Learning via human feedback in continuous state and action spaces

This paper considers the problem of extending Training an Agent Manually via Evaluative Reinforcement (TAMER) in continuous state and action spaces. Investigative research using the TAMER framework enables a non-technical human to train an agent through a natural form of human feedback (negative or positive). The advantages of TAMER have been shown on tasks of training agents by only human feedback or combining human feedback with environment rewards. However, these methods are originally designed for discrete state-action, or continuous state-discrete action problems. This paper proposes an extension of TAMER to allow both continuous states and actions, called ACTAMER. The new framework utilizes any general function approximation of a human trainer’s feedback signal. Moreover, a combined capability of ACTAMER and reinforcement learning is also investigated and evaluated. The combination of human feedback and reinforcement learning is studied in both settings: sequential and simultaneous. Our experimental results demonstrate the proposed method successfully allowing a human to train an agent in two continuous state-action domains: Mountain Car and Cart-pole (balancing).

[1]  Wei Zhang,et al.  A Reinforcement Learning Approach to job-shop Scheduling , 1995, IJCAI.

[2]  Peter Stone,et al.  Reinforcement Learning with Human Feedback in Mountain Car , 2011, AAAI Spring Symposium: Help Me Help You: Bridging the Gaps in Human-Agent Collaboration.

[3]  Oliver Kroemer,et al.  Learning Continuous Grasp Affordances by Sensorimotor Exploration , 2010, From Motor Learning to Interaction Learning in Robots.

[4]  Peter Stone,et al.  Training a Tetris agent via interactive shaping: a demonstration of the TAMER framework , 2010, AAMAS.

[5]  Dimitri P. Bertsekas,et al.  Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems , 1996, NIPS.

[6]  Peter Stone,et al.  Function Approximation via Tile Coding: Automating Parameter Choice , 2005, SARA.

[7]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[8]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[9]  Peter Stone,et al.  Interactively shaping agents via human reinforcement: the TAMER framework , 2009, K-CAP '09.

[10]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[11]  Jianghao Li,et al.  Microassembly path planning using reinforcement learning for improving positioning accuracy of a 1 cm3 omni-directional mobile microrobot , 2011, Applied Intelligence.

[12]  Peter Stone,et al.  Combining manual feedback with subsequent MDP reward signals for reinforcement learning , 2010, AAMAS.

[13]  Olivier Sigaud,et al.  From Motor Learning to Interaction Learning in Robots , 2010, From Motor Learning to Interaction Learning in Robots.

[14]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[15]  W. Bradley Knox and Peter Stone,et al.  Reinforcement Learning with Human and MDP Reward , 2012 .

[16]  Farbod Fahimi,et al.  Online human training of a myoelectric prosthesis controller via actor-critic reinforcement learning , 2011, 2011 IEEE International Conference on Rehabilitation Robotics.

[17]  Vijay R. Konda,et al.  OnActor-Critic Algorithms , 2003, SIAM J. Control. Optim..

[18]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[19]  Peter Stone,et al.  Reinforcement learning from simultaneous human and MDP reward , 2012, AAMAS.

[20]  Ole-Christoffer Granmo,et al.  Accelerated Bayesian learning for decentralized two-armed bandit based decision making with applications to the Goore Game , 2013, Applied Intelligence.

[21]  Vittaldas V. Prabhu,et al.  Distributed Reinforcement Learning Control for Batch Sequencing and Sizing in Just-In-Time Manufacturing Systems , 2004, Applied Intelligence.

[22]  Gloria E. Phillips-Wren,et al.  Innovations in agent collaboration, cooperation and Teaming, Part 2 , 2007, J. Netw. Comput. Appl..

[23]  Oliver Kroemer,et al.  Combining active learning and reactive control for robot grasping , 2010, Robotics Auton. Syst..

[24]  Paloma Martínez,et al.  Learning teaching strategies in an Adaptive and Intelligent Educational System through Reinforcement Learning , 2009, Applied Intelligence.

[25]  Jan Peters,et al.  Policy Search for Motor Primitives in Robotics , 2008, NIPS 2008.

[26]  Nguyen Hoang Viet,et al.  Policy Gradient SMDP for Resource Allocation and Routing in Integrated Services Networks , 2008, 2008 IEEE International Conference on Networking, Sensing and Control.

[27]  Andrea Lockerd Thomaz,et al.  Reinforcement Learning with Human Teachers: Evidence of Feedback and Guidance with Implications for Learning Performance , 2006, AAAI.

[28]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[29]  Richard S. Sutton,et al.  Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[30]  K. Subramanian,et al.  Learning Options through Human Interaction , 2011 .

[31]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[32]  Bradley C. Love,et al.  A New Experimental Perspective , 2012 .

[33]  Andrew Tridgell,et al.  Learning to Play Chess Using Temporal Differences , 2000, Machine Learning.

[34]  Michael I. Jordan,et al.  Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[35]  Michael Wooldridge,et al.  Agent-based software engineering , 1997, IEE Proc. Softw. Eng..

[36]  Maziar Palhang,et al.  Multi-criteria expertness based cooperative Q-learning , 2012, Applied Intelligence.

[37]  TaeChoong Chung,et al.  Hessian matrix distribution for Bayesian policy gradient reinforcement learning , 2011, Inf. Sci..

[38]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[39]  Ian H. Witten,et al.  An Adaptive Optimal Controller for Discrete-Time Markov Environments , 1977, Inf. Control..

[40]  Manuel Lopes,et al.  Fitted Natural Actor-Critic: A New Algorithm for Continuous State-Action MDPs , 2008, ECML/PKDD.

[41]  G. Tesauro Practical Issues in Temporal Difference Learning , 1992 .

[42]  Ashwin Ram,et al.  Experiments with Reinforcement Learning in Problems with Continuous State and Action Spaces , 1997, Adapt. Behav..

[43]  W. B. Knox Augmenting Reinforcement Learning with Human Feedback , 2011 .

[44]  Matthew E. Taylor,et al.  Integrating Human Demonstration and Reinforcement Learning : Initial Results in Human-Agent Transfer , 2010 .

[45]  Thomas G. Dietterich,et al.  Reinforcement Learning Via Practice and Critique Advice , 2010, AAAI.

[46]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[47]  M. Gabriel,et al.  Learning and Computational Neuroscience: Foundations of Adaptive Networks , 1990 .

[48]  P. Stone,et al.  TAMER: Training an Agent Manually via Evaluative Reinforcement , 2008, 2008 7th IEEE International Conference on Development and Learning.

[49]  Betty J. Mohler,et al.  Imitation and Reinforcement Learning for Motor Primitives with Perceptual Coupling , 2010, From Motor Learning to Interaction Learning in Robots.

[50]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.