Managing scientific data: From data integration to scientific workflows*

Scientists are confronted with significant datamanagement problems due to the large volume and high complexity of scientific data. In particular, the latter makes data integration a difficult technical challenge. In this paper, we describe our work on semantic mediation and scientific workflows, and discuss how these technologies address integration challenges in scientific data management. We first give an overview of the main data-integration problems that arise from heterogeneity in the syntax, structure, and semantics of data. Starting from a traditional mediator approach, we show how semantic extensions can facilitate data integration in complex, multipleworlds scenarios, where data sources cover different but related scientific domains. Such scenarios are not amenable to conventional schema-integration approaches. The core idea of semantic mediation is to augment database mediators and query evaluation algorithms with appropriate knowledge-representation techniques to exploit information from shared ontologies. Semantic mediation relies on semantic data registration, which associates existing data with semantic information from an ontology. The Kepler scientific workflow system addresses the problem of synthesizing, from existing tools and applications, reusable workflow components and analytical pipelines to automate scientific analyses. After presenting core features and example workflows in Kepler, we present a framework for adding semantic information to scientific workflows. The resulting system is aware of semantically plausible connections between workflow components as well as between data sources and workflow components. This information can be used by the scientist during workflow design, and by the workflow engineer for creating data transformation steps between semantically compatible but structurally incompatible analytical steps. ∗Work supported by NSF/ITR 0225673 (GEON), NSF/ITR 0225676 (SEEK), NIH/NCRR 1R24 RR019701-01 Biomedical Informatics Research Network (BIRN-CC), and DOE SciDAC DE-FC02-01ER25486 (SDM) †San Diego Supercomputer Center, University of California, San Diego, {ludaesch,lin,bowers,efrat,baru}@sdsc.edu ‡Natural Resources of Canada, brodaric@nrcan.gc.ca

[1]  Ian Horrocks,et al.  Keys, Nominals, and Concrete Domains , 2003, IJCAI.

[2]  Heiner Stuckenschmidt,et al.  Ontologies for geographic information processing , 2002 .

[3]  Gilles Kahn,et al.  Coroutines and Networks of Parallel Processes , 1977, IFIP Congress.

[4]  Diego Calvanese,et al.  The Description Logic Handbook: Theory, Implementation, and Applications , 2003, Description Logic Handbook.

[5]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[6]  Bertram Ludäscher,et al.  Model-based mediation with domain maps , 2001, Proceedings 17th International Conference on Data Engineering.

[7]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[8]  Michael Uschold,et al.  Knowledge level modelling: concepts and terminology , 1998, The Knowledge Engineering Review.

[9]  Edward A. Lee,et al.  Dataflow process networks , 1995, Proc. IEEE.

[10]  Alon Y. Halevy,et al.  Answering queries using views: A survey , 2001, The VLDB Journal.

[11]  Edward A. Lee,et al.  Ptolemy II, Heterogeneous Concurrent Modeling and Design in JAVA , 2001 .

[12]  Bertram Ludäscher,et al.  Managing Semistructured Data with FLORID: A Deductive Object-Oriented Perspective , 1998, Inf. Syst..

[13]  Thomas R. Gruber,et al.  A translation approach to portable ontology specifications , 1993, Knowl. Acquis..

[14]  Gio Wiederhold,et al.  Mediators in the architecture of future information systems , 1992, Computer.

[15]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[16]  W. B. Harland,et al.  A Geologic Time Scale 1989 , 1990 .

[17]  Alon Y. Levy Logic-based techniques in data integration , 2001 .

[18]  I. Foster,et al.  The grid: computing without bounds. , 2003, Scientific American.

[19]  Bertram Ludäscher,et al.  On integrating scientific resources through semantic registration , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[20]  Yannis Papakonstantinou,et al.  Expressive Capabilities Description Languages and Query Rewriting Algorithms , 2000, J. Log. Program..

[21]  Amit P. Sheth,et al.  Changing Focus on Interoperability in Information Systems:From System, Syntax, Structure to Semantics , 1999 .

[22]  Bertram Ludäscher,et al.  Processing Unions of Conjunctive Queries with Negation under Limited Access Patterns , 2004, EDBT.

[23]  Jennifer Widom,et al.  The TSIMMIS Approach to Mediation: Data Models and Languages , 1997, Journal of Intelligent Information Systems.

[24]  Bertram Ludäscher,et al.  A Model-Based Mediator System for Scientific Data Management , 2003, Bioinformatics.

[25]  W. B. Harland,et al.  A Geologic time scale , 1982 .

[26]  James A. Hendler,et al.  The Semantic Web" in Scientific American , 2001 .

[27]  Peter P. Chen The entity-relationship model: toward a unified view of data , 1975, VLDB '75.

[28]  Francine Berman,et al.  Grid Computing: Making the Global Infrastructure a Reality , 2003 .

[29]  A. V. Okulitch,et al.  A preliminary scheme for multihierarchical rock classification for use with thematic computer-based query systems , 2002 .

[30]  Yannis Papakonstantinou,et al.  Object Fusion in Mediator Systems , 1996, VLDB.