Metadata and provenance management

Scientists today collect, analyze, and generate TeraBytes and PetaBytes of data. These data are often shared and further processed and analyzed among collaborators. In order to facilitate sharing and data interpretations, data need to carry with it metadata about how the data was collected or generated, and provenance information about how the data was processed. This chapter describes metadata and provenance in the context of the data lifecycle. It also gives an overview of the approaches to metadata and provenance management, followed by examples of how applications use metadata and provenance in their scientific processes.

[1]  Albert J. Fleig,et al.  Provenance Tracking in an Earth Science Data Processing System , 2008, IPAW.

[2]  C. Kesselman,et al.  A Metadata Catalog Service for Data Intensive Applications , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[3]  Roger S. Barga,et al.  Automatic capture and efficient storage of e‐Science experiment provenance , 2008, Concurr. Comput. Pract. Exp..

[4]  Declan Butler,et al.  Electronic notebooks: A new leaf , 2005, Nature.

[5]  Paul T. Groth,et al.  Provenance: The Bridge Between Experiments and Data , 2008, Computing in Science & Engineering.

[6]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.

[7]  Cláudio T. Silva,et al.  Managing Rapidly-Evolving Scientific Workflows , 2006, IPAW.

[8]  Jeremy J. Carroll,et al.  Resource description framework (rdf) concepts and abstract syntax , 2003 .

[9]  James Frew,et al.  Automatic capture and reconstruction of computational provenance , 2008 .

[10]  D. Lanter Design of a Lineage-Based Meta-Data Base for GIS , 1991 .

[11]  Luc Moreau,et al.  The Open Provenance Model , 2007 .

[12]  Michael Gertz,et al.  Annotating scientific images: a concept-based approach , 2002, Proceedings 14th International Conference on Scientific and Statistical Database Management.

[13]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[14]  Frederic P. Miller,et al.  IPCC fourth assessment report , 2009 .

[15]  Karan Bhatia,et al.  SOAs for Scientific Applications: Experiences and Challenges , 2007, Third IEEE International Conference on e-Science and Grid Computing (e-Science 2007).

[16]  Soon Myoung Chung,et al.  Semantic-Based Access Control for Grid Data Resources in Open Grid Services Architecture - Data Access and Integration (OGSA-DAI) , 2008, 2008 20th IEEE International Conference on Tools with Artificial Intelligence.

[17]  Paul T. Groth,et al.  Connecting Scientific Data to Scientific Experiments with Provenance , 2007, Third IEEE International Conference on e-Science and Grid Computing (e-Science 2007).

[18]  Arthur W. Toga,et al.  Neuroimaging Data Provenance Using the LONI Pipeline Workflow Environment , 2008, IPAW.

[19]  Asunción Gómez-Pérez,et al.  (KA)2: building ontologies for the Internet: a mid-term report , 1999, Int. J. Hum. Comput. Stud..

[20]  Dieter Fensel,et al.  Knowledge Engineering: Principles and Methods , 1998, Data Knowl. Eng..

[21]  Alexander S. Szalay,et al.  VOTable: Tabular Data for the Virtual Observatory , 2004 .

[22]  SWAD-Europe Deliverable 10.2: Mapping Semantic Web Data with RDBMSes , 2004 .

[23]  Judith A. Blake,et al.  The Mouse Genome Database (MGD): the model organism database for the laboratory mouse , 2002, Nucleic Acids Res..

[24]  Daniel S. Katz,et al.  Montage: a grid-enabled engine for delivering custom science-grade mosaics on demand , 2004, SPIE Astronomical Telescopes + Instrumentation.

[25]  Cláudio T. Silva,et al.  End-to-End eScience: Integrating Workflow, Query, Visualization, and Provenance at an Ocean Observatory , 2008, 2008 IEEE Fourth International Conference on eScience.

[26]  Asunción Gómez-Pérez,et al.  Complex Data-Intensive Systems and Semantic Grid: Applications in Satellite Missions , 2006, 2006 Second IEEE International Conference on e-Science and Grid Computing (e-Science'06).

[27]  James Frew,et al.  Lineage retrieval for scientific data processing: a survey , 2005, CSUR.

[28]  Jane Greenberg,et al.  Design and Implementation of the National Institute of Environmental Health Sciences Dublin Core Metadata Schema , 2001, Dublin Core Conference.

[29]  Narendra Karmarkar,et al.  A new polynomial-time algorithm for linear programming , 1984, Comb..

[30]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[31]  Cláudio T. Silva,et al.  Managing the Evolution of Dataflows with VisTrails , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[32]  Ian Foster,et al.  The First Provenance Challenge , 2008 .

[33]  Li Zhao,et al.  Managing Large-Scale Workflow Execution from Resource Provisioning to Provenance Tracking: The CyberShake Example , 2006, 2006 Second IEEE International Conference on e-Science and Grid Computing (e-Science'06).

[34]  Junwei Cao,et al.  A Case Study on the Use of Workflow Technologies for Scientific Analysis: Gravitational Wave Data Analysis , 2007, Workflows for e-Science, Scientific Workflows for Grids.

[35]  C. Aulbert,et al.  Detector description and performance for the first coincidence observations between LIGO and GEO , 2003 .

[36]  Dan Brickley,et al.  Rdf vocabulary description language 1.0 : Rdf schema , 2004 .

[37]  Shaowen Wang,et al.  Towards provenance-aware geographic information systems , 2008, GIS '08.

[38]  Carl Kesselman,et al.  Grid-based metadata services , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[39]  Geoffrey C. Fox,et al.  Examining the Challenges of Scientific Workflows , 2007, Computer.

[40]  Paul T. Groth,et al.  Extracting causal graphs from an open provenance data model , 2008, Concurr. Comput. Pract. Exp..

[41]  Luc Moreau,et al.  Report on the International Provenance and Annotation Workshop: (IPAW'06) 3-5 May 2006, Chicago , 2006, SGMD.

[42]  M. Waldrop,et al.  Science 2.0. , 2008, Scientific American.

[43]  Yong Zhao,et al.  Tracking provenance in a virtual data grid , 2008, Concurr. Comput. Pract. Exp..

[44]  Enrico Motta,et al.  PlanetOnto: From News Publishing to Integrated Knowledge Management Support , 2000, IEEE Intell. Syst..

[45]  James D. Myers,et al.  Collaborative Electronic Notebooks as Electronic Records: Design Issues for the Secure Electronic Laboratory Notebook (ELN) , 2003 .

[46]  Carole A. Goble,et al.  Taverna/myGrid: Aligning a Workflow System with the Life Sciences Community , 2007, Workflows for e-Science, Scientific Workflows for Grids.

[47]  Luc Moreau,et al.  Semantic Description, Publication and Discovery of Workflows in myGrid , 2004 .

[48]  James Liebert,et al.  The Two Micron All Sky Survey (2MASS): Overview and Status , 1997 .

[49]  Dennis Gannon,et al.  Active management of scientific data , 2005, IEEE Internet Computing.

[50]  I. Foster,et al.  Enabling worldwide access to climate simulation data: the earth system grid (ESG) , 2006 .

[51]  James Frew,et al.  Earth System Science Workbench: a data management infrastructure for earth science products , 2001, Proceedings Thirteenth International Conference on Scientific and Statistical Database Management. SSDBM 2001.

[52]  Juliana Freire,et al.  Tackling the Provenance Challenge one layer at a time , 2008, Concurr. Comput. Pract. Exp..

[53]  E. Greisen,et al.  Representations of celestial coordinates in FITS , 2002, astro-ph/0207413.

[54]  Ewa Deelman,et al.  Pegasus: Mapping Large-Scale Workflows to Distributed Resources , 2007, Workflows for e-Science, Scientific Workflows for Grids.

[55]  Abdul Aziz,et al.  Grid resource allocation and task scheduling for resource intensive applications , 2006, 2006 International Conference on Parallel Processing Workshops (ICPPW'06).

[56]  Michael McCann,et al.  Oceanographic Data Provenance Tracking with the Shore Side Data System , 2008, IPAW.

[57]  Jane Greenberg,et al.  Metadata and the world wide web , 2002 .

[58]  Arie Shoshani,et al.  Automation of Network-Based Scientific Workflows , 2007, Grid-Based Problem Solving Environments.

[59]  Paul T. Groth,et al.  The provenance of electronic data , 2008, CACM.

[60]  James A. Hendler,et al.  Ontology-based Web agents , 1997, AGENTS '97.

[61]  Yogesh L. Simmhan,et al.  A survey of data provenance in e-science , 2005, SGMD.

[62]  Jane Hunter Harvesting community tags and annotations to augment institutional repository metadata , 2007 .