Identifying Reusable Early-Life Options

We introduce a method for identifying short-duration reusable motor behaviors, which we call early-life options, that allow robots to perform well even in the very early stages of their lives. This is important when agents need to operate in environments where the use of poor-performing policies (such as the random policies with which they are typically initialized) may be catastrophic. Our method augments the original action set of the agent with specially-constructed behaviors that maximize performance over a possibly infinite family of related motor tasks. These are akin to primitive reflexes in infant mammals—agents born with our early-life options, even if acting randomly, are capable of producing rudimentary behaviors comparable to those acquired by agents that actively optimize a policy for hundreds of thousands of steps. We also introduce three metrics for identifying useful early-life options and show that they result in behaviors that maximize both the option's expected return while minimizing the risk that executing the option will result in extremely poor performance. We evaluate our technique on three simulated robots tasked with learning to walk under different battery consumption constraints and show that even random policies over early-life options are already sufficient to allow for the agent to perform similarly to agents trained for hundreds of thousands of steps.

[1]  Doina Precup,et al.  Learning with Options that Terminate Off-Policy , 2017, AAAI.

[2]  A. Stuart,et al.  Portfolio Selection: Efficient Diversification of Investments , 1959 .

[3]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[4]  Alexei A. Efros,et al.  Investigating Human Priors for Playing Video Games , 2018, ICML.

[5]  Bruno Castro da Silva,et al.  Learning Parameterized Skills , 2012, ICML.

[6]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[7]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[8]  Doina Precup,et al.  When Waiting is not an Option : Learning Options with a Deliberation Cost , 2017, AAAI.

[9]  Richard Taffler,et al.  Skewness Preference and Market Anomalies , 2019 .

[10]  Marlos C. Machado,et al.  A Laplacian Framework for Option Discovery in Reinforcement Learning , 2017, ICML.

[11]  Zoubin Ghahramani,et al.  A Comparison of Human and Agent Reinforcement Learning in Partially Observable Domains , 2011, CogSci.

[12]  Jonas Schmitt Portfolio Selection Efficient Diversification Of Investments , 2016 .

[13]  N. Berthier,et al.  Development of reaching in infancy , 2006, Experimental Brain Research.

[14]  Andrew G. Barto,et al.  Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density , 2001, ICML.

[15]  Elizabeth S. Spelke,et al.  Principles of Object Perception , 1990, Cogn. Sci..

[16]  Olivier Sigaud,et al.  Learning compact parameterized skills with a single regression , 2013, 2013 13th IEEE-RAS International Conference on Humanoid Robots (Humanoids).

[17]  Jan Peters,et al.  Nonamemanuscript No. (will be inserted by the editor) Reinforcement Learning to Adjust Parametrized Motor Primitives to , 2011 .