Finding Options that Minimize Planning Time

We formalize the problem of selecting the optimal set of options for planning as that of computing the smallest set of options so that planning converges in less than a given maximum of value-iteration passes. We first show that the problem is NP-hard, even if the task is constrained to be deterministic---the first such complexity result for option discovery. We then present the first polynomial-time boundedly suboptimal approximation algorithm for this setting, and empirically evaluate it against both the optimal options and a representative collection of heuristic approaches in simple grid-based domains including the classic four-rooms problem.

[1]  Vasek Chvátal,et al.  A Greedy Heuristic for the Set-Covering Problem , 1979, Math. Oper. Res..

[2]  Andrew G. Barto,et al.  Skill Characterization Based on Betweenness , 2008, NIPS.

[3]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[4]  Marlos C. Machado,et al.  A Laplacian Framework for Option Discovery in Reinforcement Learning , 2017, ICML.

[5]  Sergey Levine,et al.  Diversity is All You Need: Learning Skills without a Reward Function , 2018, ICLR.

[6]  Alicia P. Wolfe,et al.  Identifying useful subgoals in reinforcement learning by local graph partitioning , 2005, ICML.

[7]  Ran Raz,et al.  A sub-constant error-probability low-degree test, and a sub-constant error-probability PCP characterization of NP , 1997, STOC '97.

[8]  Pierre-Luc Bacon On the Bottleneck Concept for Options Discovery: Theoretical Underpinnings and Extension in Continuous State Spaces , 2013 .

[9]  Yishay Mansour,et al.  Approximate Equivalence of Markov Decision Processes , 2003, COLT.

[10]  Alireza Khadivi,et al.  Automatic skill acquisition in reinforcement learning using graph centrality measures , 2012, Intell. Data Anal..

[11]  Ran Raz,et al.  Label Cover Instances with Large Girth and the Hardness of Approximating Basic k-Spanner , 2012, ICALP.

[12]  Andrew G. Barto,et al.  PolicyBlocks: An Algorithm for Creating Useful Macro-Actions in Reinforcement Learning , 2002, ICML.

[13]  Leslie Pack Kaelbling,et al.  On the Complexity of Solving Markov Decision Problems , 1995, UAI.

[14]  Shie Mannor,et al.  Scaling Up Approximate Value Iteration with Options: Better Policies with Fewer Iterations , 2014, ICML.

[15]  Andrew G. Barto,et al.  Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density , 2001, ICML.

[16]  Lihong Li,et al.  PAC-inspired Option Discovery in Lifelong Reinforcement Learning , 2014, ICML.

[17]  George Konidaris,et al.  Constructing Abstraction Hierarchies Using a Skill-Symbol Loop , 2015, IJCAI.

[18]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[19]  Shie Mannor,et al.  Q-Cut - Dynamic Discovery of Sub-goals in Reinforcement Learning , 2002, ECML.

[20]  Bram Bakker,et al.  Hierarchical Reinforcement Learning Based on Subgoal Discovery and Subpolicy Specialization , 2003 .

[21]  Alec Solway,et al.  Optimal Behavioral Hierarchy , 2014, PLoS Comput. Biol..

[22]  John N. Tsitsiklis,et al.  The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[23]  Peter Stone,et al.  The utility of temporal abstraction in reinforcement learning , 2008, AAMAS.

[24]  Rina Panigrahy,et al.  An O(log*n) approximation algorithm for the asymmetric p-center problem , 1996, SODA '96.

[25]  Guy Kortsarz On the Hardness of Approximating Spanners , 2001, Algorithmica.

[26]  Glenn A. Iba,et al.  A Heuristic Approach to the Discovery of Macro-Operators , 1989, Machine Learning.

[27]  Ronald J. Williams,et al.  Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions , 1993 .

[28]  Kyomin Jung,et al.  Transitive-Closure Spanners , 2008, SIAM J. Comput..

[29]  Romain Laroche,et al.  On Value Function Representation of Long Horizon Problems , 2018, AAAI.

[30]  Ashwin Ram,et al.  The Utility Problem in Case-Based Reasoning , 1993 .

[31]  Michael L. Littman,et al.  Probabilistic Propositional Planning: Representations and Complexity , 1997, AAAI/IAAI.

[32]  David Steurer,et al.  Analytical approach to parallel repetition , 2013, STOC.

[33]  Aaron Archer Two O (log* k)-Approximation Algorithms for the Asymmetric k-Center Problem , 2001, IPCO.

[34]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[35]  Doina Precup,et al.  When Waiting is not an Option : Learning Options with a Deliberation Cost , 2017, AAAI.

[36]  David Silver,et al.  Compositional Planning Using Optimal Option Models , 2012, ICML.

[37]  Sudipto Guha,et al.  Asymmetric k-center is log* n-hard to approximate , 2005, JACM.

[38]  Andrew G. Barto,et al.  Using relative novelty to identify useful temporal abstractions in reinforcement learning , 2004, ICML.

[39]  Andrew G. Barto,et al.  Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining , 2009, NIPS.

[40]  Doina Precup,et al.  Learning Options in Reinforcement Learning , 2002, SARA.

[41]  Dorit S. Hochbaum,et al.  Approximation Algorithms for the Set Covering and Vertex Cover Problems , 1982, SIAM J. Comput..

[42]  H. Simon,et al.  Models Of Man : Social And Rational , 1957 .

[43]  Irit Dinur,et al.  On the hardness of approximating label-cover , 2004, Inf. Process. Lett..

[44]  Shie Mannor,et al.  Approximate Value Iteration with Temporally Extended Actions , 2015, J. Artif. Intell. Res..

[45]  Michael L. Littman,et al.  The Complexity of Plan Existence and Evaluation in Probabilistic Domains , 1997, UAI.

[46]  Doina Precup,et al.  Multi-time Models for Temporally Abstract Planning , 1997, NIPS.