When Waiting is not an Option : Learning Options with a Deliberation Cost

Recent work has shown that temporally extended actions (options) can be learned fully end-to-end as opposed to being specified in advance. While the problem of "how" to learn options is increasingly well understood, the question of "what" good options should be has remained elusive. We formulate our answer to what "good" options should be in the bounded rationality framework (Simon, 1957) through the notion of deliberation cost. We then derive practical gradient-based learning algorithms to implement this objective. Our results in the Arcade Learning Environment (ALE) show increased performance and interpretability.

[1]  H. Simon,et al.  Models Of Man : Social And Rational , 1957 .

[2]  H. Simon,et al.  "Models of Man"@@@Models of Man: Social and Rational. Mathematical Essays on Rational Human Behavior in a Social Setting. , 1959 .

[3]  Marvin Minsky,et al.  Steps toward Artificial Intelligence , 1995, Proceedings of the IRE.

[4]  Richard Fikes,et al.  Learning and Executing Generalized Robot Plans , 1993, Artif. Intell..

[5]  Benjamin Kuipers,et al.  Common-Sense Knowledge of Space: Learning from Experience , 1979, IJCAI.

[6]  R. Korf Learning to solve problems by searching for macro-operators , 1983 .

[7]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[8]  A. Neyman Bounded complexity justifies cooperation in the finitely repeated prisoners' dilemma , 1985 .

[9]  Linn I. Sennott,et al.  Constrained Discounted Markov Decision Chains , 1991, Probability in the Engineering and Informational Sciences.

[10]  Gary L. Drescher,et al.  Made-up minds - a constructivist approach to artificial intelligence , 1991 .

[11]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[12]  Leslie Pack Kaelbling,et al.  Hierarchical Learning in Stochastic Domains: Preliminary Results , 1993, ICML.

[13]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[14]  Sebastian Thrun,et al.  Finding Structure in Reinforcement Learning , 1994, NIPS.

[15]  Stuart J. Russell,et al.  Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[16]  Thomas G. Dietterich The MAXQ Method for Hierarchical Reinforcement Learning , 1998, ICML.

[17]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[18]  E. Altman Constrained Markov Decision Processes , 1999 .

[19]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[20]  R. Selten,et al.  Bounded rationality: The adaptive toolbox , 2000 .

[21]  Doina Precup,et al.  Temporal abstraction in reinforcement learning , 2000, ICML 2000.

[22]  Glenn A. Iba,et al.  A heuristic approach to the discovery of macro-operators , 2004, Machine Learning.

[23]  Marek Petrik,et al.  Biasing Approximate Dynamic Programming with a Lower Discount Factor , 2008, NIPS.

[24]  M. Botvinick,et al.  Hierarchically organized behavior and its neural foundations: A reinforcement learning perspective , 2009, Cognition.

[25]  Regina Barzilay,et al.  Learning High-Level Planning from Text , 2012, ACL.

[26]  Alec Solway,et al.  Optimal Behavioral Hierarchy , 2014, PLoS Comput. Biol..

[27]  Shie Mannor,et al.  Time-Regularized Interrupting Options (TRIO) , 2014, ICML.

[28]  Honglak Lee,et al.  Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning , 2014, NIPS.

[29]  Nan Jiang,et al.  The Dependence of Effective Planning Horizon on Model Accuracy , 2015, AAMAS.

[30]  Shie Mannor,et al.  Approximate Value Iteration with Temporally Extended Actions , 2015, J. Artif. Intell. Res..

[31]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[32]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[33]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[34]  Jan Peters,et al.  Probabilistic inference for determining options in reinforcement learning , 2016, Machine Learning.

[35]  Joshua B. Tenenbaum,et al.  Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation , 2016, NIPS.

[36]  Shie Mannor,et al.  Adaptive Skills Adaptive Partitions (ASAP) , 2016, NIPS.

[37]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[38]  Dan Klein,et al.  Modular Multitask Reinforcement Learning with Policy Sketches , 2016, ICML.

[39]  Marlos C. Machado,et al.  A Laplacian Framework for Option Discovery in Reinforcement Learning , 2017, ICML.