A Critical Assessment of Benchmark Comparison in Planning

Recent trends in planning research have led to empirical comparison becoming commonplace. The field has started to settle into a methodology for such comparisons, which for obvious practical reasons requires running a subset of planners on a subset of problems. In this paper, we characterize the methodology and examine eight implicit assumptions about the problems, planners and metrics used in many of these comparisons. The problem assumptions are: PR1) the performance of a general purpose planner should not be penalized/biased if executed on a sampling of problems and domains, PR2) minor syntactic differences in representation do not affect performance, and PR3) problems should be solvable by STRIPS capable planners unless they require ADL. The planner assumptions are: PL1) the latest version of a planner is the best one to use, PL2) default parameter settings approximate good performance, and PL3) time cut-offs do not unduly bias outcome. The metrics assumptions are: M1) performance degrades similarly for each planner when run on degraded runtime environments (e.g., machine platform) and M2) the number of plan steps distinguishes performance. We find that most of these assumptions are not supported empirically; in particular, that planners are affected differently by these assumptions. We conclude with a call to the community to devote research resources to improving the state of the practice and especially to enhancing the available benchmark problems.

[1]  Gerald J. Sussman,et al.  A Computational Model of Skill Acquisition , 1973 .

[2]  Martha E. Pollack,et al.  Introducing the Tileworld: Experimentally Evaluating Agent Architectures , 1990, AAAI.

[3]  Paul R. Cohen,et al.  A Survey of the Eighth National Conference on Artificial Intelligence: Pulling Together or Pulling Apart? , 1991, AI Mag..

[4]  Leslie Pack Kaelbling,et al.  Collected notes from the Benchmarks and Metrics Workshop , 1991 .

[5]  Oren Etzioni,et al.  PRODIGY4.0: The Manual and Tutorial , 1992 .

[6]  Daniel S. Weld,et al.  UCPOP: A Sound, Complete, Partial Order Planner for ADL , 1992, KR.

[7]  D. T. Nguyen,et al.  A Beginner's Guide to the Truckworld Simulator , 1993 .

[8]  Paul R. Cohen,et al.  Benchmarks, Test Beds, Controlled Experimentation, and the Design of Agent Architectures , 1993, AI Mag..

[9]  Martha E. Pollack,et al.  Least-Cost Flaw Repair: A Plan Refinement Strategy for Partial-Order Planning , 1994, AAAI.

[10]  Keith Golden,et al.  UCPOP User's Manual , 1995 .

[11]  Avrim Blum,et al.  Fast Planning Through Planning Graph Analysis , 1995, IJCAI.

[12]  S.J.J. Smith,et al.  Empirical Methods for Artificial Intelligence , 1995 .

[13]  Raghavan Srinivasan,et al.  Comparison of Methods for Improving Search Efficiency in a Partial-Order Planner , 1995, IJCAI.

[14]  Maria Alicia Perez Learning search control knowledge to improve plan quality , 1996 .

[15]  Lenhart K. Schubert,et al.  Accelerating Partial-Order Planners: Some Techniques for Effective Search Control and Pruning , 1996, J. Artif. Intell. Res..

[16]  Martha E. Pollack,et al.  Flaw Selection Strategies for Partial-Order Planning , 1997, J. Artif. Intell. Res..

[17]  Tara A. Estlin,et al.  Learning to Improve both Efficiency and Quality of Planning , 1997, IJCAI.

[18]  Amedeo Cesta,et al.  Recent Advances in AI Planning , 1997, Lecture Notes in Computer Science.

[19]  Antonio Bicchi,et al.  Interactive benchmark for planning algorithms on the Web: http:.//www.piaggio.ccii.unipi.it/benchplanning.html , 1997, Proceedings of International Conference on Robotics and Automation.

[20]  Bernhard Nebel,et al.  Extending Planning Graphs to an ADL Subset , 1997, ECP.

[21]  Henry A. Kautz,et al.  BLACKBOX: A New Approach to the Application of Theorem Proving to Problem Solving , 1998 .

[22]  SolvingHenry KautzAT,et al.  BLACKBOX : A New Approach to the Application of Theorem Proving toProblem , 1998 .

[23]  David E. Smith,et al.  Extending Graphplan to handle uncertainty and sensing actions , 1998, AAAI 1998.

[24]  D. Long,et al.  E cient Implementation of the Plan Graph in STAN , 1999 .

[25]  Blai Bonet,et al.  Planning as Heuristic Search: New Results , 1999, ECP.

[26]  Jana Koehler Handling of Conditional Effects and Negative Goals in IPP , 1999 .

[27]  Adele E. Howe,et al.  Exploiting Competitive Planner Performance , 1999, ECP.

[28]  M. Fox,et al.  Efficient Implementation of the Plan Graph in STAN , 2011, J. Artif. Intell. Res..

[29]  Bart Selman,et al.  Unifying SAT-based and Graph-based Planning , 1999, IJCAI.

[30]  L. Darrell Whitley,et al.  Algorithm Performance and Problem Structure for Flow-shop Scheduling , 1999, AAAI/IAAI.

[31]  Steve A. Chien,et al.  Using Generic Preferences to Incrementally Improve Plan Quality , 2000, AIPS.

[32]  Drew McDermott,et al.  The 1998 AI Planning Systems Competition , 2000, AI Mag..

[33]  Patrik Haslum,et al.  Admissible Heuristics for Optimal Planning , 2000, AIPS.

[34]  Bernhard Nebel,et al.  The FF Planning System: Fast Plan Generation Through Heuristic Search , 2011, J. Artif. Intell. Res..

[35]  David E. Wilkins,et al.  A Call for Knowledge-Based Planning , 2001, AI Mag..

[36]  John K. Slaney,et al.  Blocks World revisited , 2001, Artif. Intell..

[37]  J. Watson,et al.  Toward a Descriptive Model Of Local Search Cost in Job-Shop Scheduling , 2001 .

[38]  Steve Ankuo Chien,et al.  The DATA-CHASER and Citizen Explorer Benchmark Problem Sets , 2001 .

[39]  Jörg Hoffmann,et al.  Local Search Topology in Planning Benchmarks: An Empirical Analysis , 2001, IJCAI.

[40]  Maria Fox,et al.  PDDL2.1: An Extension to PDDL for Expressing Temporal Planning Domains , 2003, J. Artif. Intell. Res..

[41]  Adele E. Howe,et al.  Test Case Generation as an AI Planning Problem , 2004, Automated Software Engineering.