Efficiently discovering critical workflows in scientific explorations

Existing workflow management systems assume that scientists have a well-specified workflow design before the execution. In reality, a lot of scientific discoveries are made as a result of a dynamic process, where scientists keep proposing new hypotheses and verifying them through multiple tries of various experiments before achieving successful experimental results. Consequently, not all the experiments in a workflow execution have necessarily contributed to the final result. In this paper, we investigate the problem of effectively reproducing the results of previous scientific workflow executions by discovering the critical experiments leading to the success and the logical constraints on their execution order. Relational schema and SQL queries have been designed for effectively recording the workflow execution log, efficiently identifying the critical experiments from the log, and recommending experiment reproduction strategies to users. Furthermore, we propose optimization techniques for evaluating such SQL queries according to the unique characteristics of the log data. Experimental evaluations demonstrate the performance speedup of our approach.

[1]  Alexander L. Wolf,et al.  Discovering models of software processes from event-based data , 1998, TSEM.

[2]  Michael J. Franklin,et al.  The Design of GridDB: A Data-Centric Overlay for the Scientific Grid , 2004, VLDB.

[3]  Yong Zhao,et al.  Chimera: a virtual data system for representing, querying, and automating data derivation , 2002, Proceedings 14th International Conference on Scientific and Statistical Database Management.

[4]  I-Min A. Chen,et al.  Modeling scientific experiments with an object data model , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[5]  Bertram Ludäscher,et al.  Kepler: an extensible system for design and execution of scientific workflows , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[6]  M. Pshirkov,et al.  Weak microlensing effect and stability of pulsar time scale , 2006, astro-ph/0610681.

[7]  Dimitrios Gunopulos,et al.  Mining Process Models from Workflow Logs , 1998, EDBT.

[8]  Myoung-Ho Kim,et al.  Analyzing the critical path for the well-formed workflow schema , 2001, Proceedings Seventh International Conference on Database Systems for Advanced Applications. DASFAA 2001.

[9]  Qihong Shao,et al.  Storing and Efficiently Querying Critical Workflows from Log in Scientific Exploration , 2007 .

[10]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[11]  Susan B. Davidson,et al.  Towards a Model of Provenance and User Views in Scientific Workflows , 2006, DILS.

[12]  Walid Gaaloul,et al.  Mining Workflow Patterns through Event-Data Analysis , 2005 .

[13]  Myoung-Ho Kim,et al.  Extracting the workflow critical path from the extended well-formed workflow schema , 2005, J. Comput. Syst. Sci..

[14]  Xiaoyin Xiao,et al.  Redox-gated electron transport in electrically wired ferrocene molecules , 2006 .

[15]  Aniruddha R. Thakar,et al.  When Database Systems Meet the Grid , 2005, CIDR.

[16]  Miron Livny,et al.  Zoo: a desktop experiment management environment , 1997, SIGMOD '97.

[17]  Gerhard Weikum,et al.  From Centralized Workflow Specification to Distributed Workflow Execution , 1998, Journal of Intelligent Information Systems.

[18]  Walid Gaaloul,et al.  Discovering Workflow Transactional Behavior from Event-Based Log , 2004, CoopIS/DOA/ODBASE.

[19]  Joachim Geiler,et al.  Workflow-based Grid applications , 2006, Future Gener. Comput. Syst..

[20]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[21]  Sudha Ram,et al.  Proceedings of the 1997 ACM SIGMOD international conference on Management of data , 1997, ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems.

[22]  Peter Z. Kunszt,et al.  Giggle: A Framework for Constructing Scalable Replica Location Services , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[23]  David J. DeWitt,et al.  Integrating databases and workflow systems , 2005, SGMD.

[24]  Joachim Herbst,et al.  A Machine Learning Approach to Workflow Management , 2000, ECML.

[25]  Rajkumar Buyya,et al.  Critical-path and priority based algorithms for scheduling workflows with parameter sweep tasks on global grids , 2005, 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'05).