Integrating databases and workflow systems

There has been an information explosion in fields of science such as high energy physics, astronomy, environmental sciences and biology. There is a critical need for automated systems to manage scientific applications and data. Database technology is well-suited to handle several aspects of workflow management. Contemporary workflow systems are built from multiple, separately developed components and do not exploit the full power of DBMSs in handling data of large magnitudes. We advocate a holistic view of a WFMS that includes not only workflow modeling but planning, scheduling, data management and cluster management. Thus, it is worthwhile to explore the ways in which databases can be augmented to manage workflows in addition to data. We present a language for modeling workflows that is tightly integrated with SQL. Each scientific program in a workflow is associated with an active table or view. The definition of data products is in relational format, and invocation of programs and querying is done in SQL. The tight coupling between workflow management and data-manipulation is an advantage for data-intensive scientific programs.

[1]  Dean Daniels,et al.  Query Processing in R* , 1985, Query Processing in Database Systems.

[2]  Donovan A. Schneider,et al.  The Gamma Database Machine Project , 1990, IEEE Trans. Knowl. Data Eng..

[3]  Goetz Graefe,et al.  Optimization of dynamic query evaluation plans , 1994, SIGMOD '94.

[4]  Stephen Kent Sloan Digital Sky Survey , 1994 .

[5]  Miron Livny,et al.  Zoo: a desktop experiment management environment , 1997, SIGMOD '97.

[6]  Serge Abiteboul,et al.  Relational transducers for electronic commerce , 1998, J. Comput. Syst. Sci..

[7]  Surajit Chaudhuri,et al.  Optimization of queries with user-defined predicates , 1996, TODS.

[8]  Alexander S. Szalay,et al.  The Sloan Digital Sky Survey , 1999, Comput. Sci. Eng..

[9]  Anthony J. Bonner,et al.  Workflow, transactions and datalog , 1999, PODS.

[10]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.

[11]  Yong Zhao,et al.  Chimera: a virtual data system for representing, querying, and automating data derivation , 2002, Proceedings 14th International Conference on Scientific and Statistical Database Management.

[12]  Peter Z. Kunszt,et al.  Giggle: A Framework for Constructing Scalable Replica Location Services , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[13]  Yannis A. Dimitriadis,et al.  Grid Characteristics and Uses: A Grid Definition , 2003, European Across Grids Conference.

[14]  Miron Livny,et al.  Stork: making data placement a first class citizen in the grid , 2004, 24th International Conference on Distributed Computing Systems, 2004. Proceedings..

[15]  Andrea C. Arpaci-Dusseau,et al.  Explicit Control in the Batch-Aware Distributed File System , 2004, NSDI.

[16]  Ling Zhang,et al.  Building Grid Monitoring System Based on Globus Toolkit: Architecture and Implementation , 2004, CIS.

[17]  Yolanda Gil,et al.  Pegasus: Mapping Scientific Workflows onto the Grid , 2004, European Across Grids Conference.

[18]  Michael J. Franklin,et al.  The Design of GridDB: A Data-Centric Overlay for the Scientific Grid , 2004, VLDB.

[19]  Bertram Ludäscher,et al.  Kepler: an extensible system for design and execution of scientific workflows , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[20]  Aniruddha R. Thakar,et al.  When Database Systems Meet the Grid , 2005, CIDR.

[21]  Jacek Becla,et al.  Lessons Learned from Managing a Petabyte , 2005, CIDR.

[22]  Jennifer Widom,et al.  Trio: A System for Integrated Management of Data, Accuracy, and Lineage , 2004, CIDR.