Hidden Parameter Markov Decision Processes: An Emerging Paradigm for Modeling Families of Related Tasks

Introduction The goal of transfer is to use knowledge obtained by solving one task to improve a robot’s (or software agent’s) performance in future tasks. In general, we do not expect this to work; for transfer to be feasible, there must be something in common between the source task(s) and goal task(s). The question at the core of the transfer learning enterprise is therefore: what makes two tasks related?, or more generally, how do you define a family of related tasks? Given a precise definition of how a particular family of tasks is related, we can formulate clear optimization methods for selecting source tasks and determining what knowledge should be imported from the source task(s), and how it should be used in the target task(s). This paper describes one model that has appeared in several different research scenarios where an agent is faced with a family of tasks that have similar, but not identical, dynamics (or reward functions). For example, a human learning to play baseball may, over the course of their career, be exposed to several different bats, each with slightly different weights and lengths. A human who has learned to play baseball well with one bat would be expected to be able to pick up any similar bat and use it. Similarly, when learning to drive a car, one may learn in more than one car, and then be expected to be able to drive any make and model of car (within reasonable variations) with little or no relearning. These examples are instances of exactly the kind of flexible, reliable, and sample-efficient behavior that we should be aiming to achieve in robotics applications. One way to model such a family of tasks is to posit that they are generated by a small set of latent parameters (e.g., the length and weight of the bat, or parameters describing the various physical properties of the car’s steering system and clutch) that are fixed for each problem instance (e.g., for each bat, or car), but are not directly observable by the agent. Defining a distribution over these latent parameters results in a family of related tasks, and transfer is feasible to the extent that the number of latent variables is small, the task dynamics (or reward function) vary smoothly with them, and to the extent to which they can either be ignored or identified using transition data from the task. This model has appeared

[1]  Bruno Castro da Silva,et al.  Learning Parameterized Skills , 2012, ICML.

[2]  David Hsu,et al.  Integrated Perception and Planning in the Continuous Space: A POMDP Approach , 2013, Robotics: Science and Systems.

[3]  Shimon Whiteson,et al.  Protecting against evaluation overfitting in empirical reinforcement learning , 2011, 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[4]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[5]  David Wingate,et al.  A Physics-Based Model Prior for Object-Oriented MDPs , 2014, ICML.

[6]  Alessandro Lazaric,et al.  Sequential Transfer in Multi-armed Bandit with Finite Set of Models , 2013, NIPS.

[7]  Pushmeet Kohli,et al.  Adapting Interaction Environments to Diverse Users through Online Action Set Selection , 2014, AAAI 2014.

[8]  Sriraam Natarajan,et al.  A Decision-Theoretic Model of Assistance , 2007, IJCAI.

[9]  Eric Eaton,et al.  Online Multi-Task Learning for Policy Gradient Methods , 2014, ICML.

[10]  Lihong Li,et al.  Sample Complexity of Multi-task Reinforcement Learning , 2013, UAI.

[11]  Alan Fern,et al.  Transfer Learning in Sequential Decision Problems: A Hierarchical Bayesian Approach , 2012, ICML Unsupervised and Transfer Learning.

[12]  Steven M. LaValle,et al.  Planning algorithms , 2006 .

[13]  Shimon Whiteson,et al.  The Reinforcement Learning Competitions , 2010 .

[14]  David Hsu,et al.  Planning how to learn , 2013, 2013 IEEE International Conference on Robotics and Automation.

[15]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[16]  Benjamin Saul Rosman,et al.  Learning domain abstractions for long lived robots , 2014 .

[17]  Alan Fern,et al.  A Computational Decision Theory for Interactive Assistants , 2010, Interactive Decision Theory and Game Theory.

[18]  Shimon Whiteson,et al.  Neuroevolutionary reinforcement learning for generalized helicopter control , 2009, GECCO.

[19]  Finale Doshi-Velez,et al.  Hidden Parameter Markov Decision Processes: A Semiparametric Regression Approach for Discovering Latent Task Parametrizations , 2013, IJCAI.