论文信息 - Planning with Closed Loop Macro Actions

Planning with Closed Loop Macro Actions

Planning with ClosedLo op Macro ActionsDoina PrecupUniversityof MassachusettsAmherst MA http wwwcsumassedu dprecupRichard S SuttonyofMassacttp wwwcsumassedu ricSatinder Singhy of ColoradoBoulder CO ttp wwwcscoloradoedu bavejaAbstractPlanning and learning at multiple levels of temp oral abstraction is a key problem for articial intelligence In this pap er we summarize an approachtothis problem based on the mathematical frameworkof Markov decision pro cesses and reinforcement learning Conventional mo delbased reinforcement learninguses primitive actions that last one time step and thatcan b e mo deled indep endently of the learning agentThese can b e generalized tomacro actionsmultistepactions sp ecied by an arbitrary p olicy and a wayofcompleting Macro actions generalize the classical notion of a macro op erator in that they are closed lo opuncertain and of variable duration Macro actions areneeded to represent commonsense higherlevel actionssuch as going to lunch grasping an ob ject or traveling to a distantcity This pap er generalizes prior workon temp orally abstract mo dels Sutton and extends it from the prediction setting to include actionscontrol and planning Wedene asemantics of mo dels of macro actions that guarantees the validityofplanning using such mo dels This pap er presentnewresults in the theory of planning with macro actionsand illustrates its p otential advantages in a gridworldtaskIntro ductionThe need for hierarchical and abstract planning is afundamental problem in AI see eg Sacerdoti Laird et al Korf Kaelbling DayanHinton Mo delbased reinforcement learning oers a p ossible solution to the problem of integrating planning with realtime learning and decisionmaking Peng Williams Mo ore Atkeson Sutton Sutton Barto Howeverconventional mo delbased reinforcement learning usesonestep mo dels that cannot represent commonsensehigherlevel actionsMo deling such actions requiresthe ability to handle dierent interrelated levels oftemp oral abstractionSeveral researchers have prop osed extending reinforcement learning to a higher level by treating entireclosedlo op p olicies as actions whichwe callmacroactionseg Mahadevan Connell Singh Hub er Grup en Parr Russell p ersonal communication Dietterich p ersonal communication McGovern Sutton Fagg Each macro action issp ecied by a closedlo op p olicywhich determines theprimitive actions when the macro action is in forceand by a completion function which determines whenthe macro action ends When the macro action completes a new primitive or macro action can b e selectedMacro actions are like AIs classical macro op erators in that they can takecontrol for some p erio dof time determining the actions during that time andin that one can cho ose among macro actions muchasone originally chose among primitive actions Howeverclassical macro op erators are only a xed sequenceof actions whereas macro actions incorp orate a general closedlo op p olicy and completion criterion Thesegeneralizations are required when the environmentissto chastic and uncertain with general goals as in reinforcement learning and Markov decision pro cessesThis pap er extends an approach to planning withmacro actions intro duced by Sutton based onprior work by Singh Dayan and by Sutton and Pinette This approach enables mo dels of the environment at dierent temp oral scales tobe intermixed pro ducing temp orally abstract mo delsSutton Dayan and Sutton and Pinette were concerned only with predicting the environment in eect mo deling a single macro action LikeSingh we mo del a whole set of macro actionsand consider choices among them but the extensionsummarized here includes more general macro actionsand control of the environment Wedevelop a generaltheory of mo deling and planning with macro actionsTo illustrate the kind of advance we are trying tomake consider the example task depicted in Figure

[1] Earl David Sacerdoti,et al. A Structure for Plans and Behavior , 1977 .

[2] R. Korf. Learning to solve problems by searching for macro-operators , 1983 .

[3] Dimitri P. Bertsekas,et al. Dynamic Programming: Deterministic and Stochastic Models , 1987 .

[4] Richard S. Sutton,et al. Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[5] Geoffrey E. Hinton,et al. Feudal Reinforcement Learning , 1992, NIPS.

[6] Satinder P. Singh,et al. Scaling Reinforcement Learning Algorithms by Learning Variable Temporal Resolution Models , 1992, ML.

[7] Sridhar Mahadevan,et al. Automatic Programming of Behavior-Based Robots Using Reinforcement Learning , 1991, Artif. Intell..

[8] C. Atkeson,et al. Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[9] Peter Dayan,et al. Improving Generalization for Temporal Difference Learning: The Successor Representation , 1993, Neural Computation.

[10] Jing Peng,et al. Efficient Learning and Planning Within the Dyna Framework , 1993, Adapt. Behav..

[11] Leslie Pack Kaelbling,et al. Hierarchical Learning in Stochastic Domains: Preliminary Results , 1993, ICML.

[12] Richard S. Sutton,et al. TD Models: Modeling the World at a Mixture of Time Scales , 1995, ICML.

[13] Ben J. A. Kröse,et al. Learning from delayed rewards , 1995, Robotics Auton. Syst..

[14] Roderic A. Grupen,et al. Learning to Coordinate Controllers - Reinforcement Learning on a Control Basis , 1997, IJCAI.

[15] W. Wong,et al. On ψ-Learning , 2003 .

[16] David H. Jonassen,et al. Learning to Solve Problems , 2003 .

[17] A. Newell,et al. Chunking in Soar: The Anatomy of a General Learning Mechanism , 1986, Machine Learning.

[18] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[19] Oren B. Yeshua. Hierarchical Learning , 2007 .