Ontology-Driven Provenance Management in eScience: An Application in Parasite Research

Provenance, from the French word "provenir ", describes the lineage or history of a data entity. Provenance is critical information in scientific applications to verify experiment process, validate data quality and associate trust values with scientific results. Current industrial scale eScience projects require an end-to-end provenance management infrastructure. This infrastructure needs to be underpinned by formal semantics to enable analysis of large scale provenance information by software applications. Further, effective analysis of provenance information requires well-defined query mechanisms to support complex queries over large datasets. This paper introduces an ontology-driven provenance management infrastructure for biology experiment data, as part of the Semantic Problem Solving Environment (SPSE) for Trypanosoma cruzi (T.cruzi ). This provenance infrastructure, called T.cruzi Provenance Management System (PMS), is underpinned by (a) a domain-specific provenance ontology called Parasite Experiment ontology, (b) specialized query operators for provenance analysis, and (c) a provenance query engine. The query engine uses a novel optimization technique based on materialized views called materialized provenance views (MPV) to scale with increasing data size and query complexity. This comprehensive ontology-driven provenance infrastructure not only allows effective tracking and management of ongoing experiments in the Tarleton Research Group at the Center for Tropical and Emerging Global Diseases (CTEGD), but also enables researchers to retrieve the complete provenance information of scientific results for publication in literature.

[1]  Michael L. Raymer,et al.  A Proposed Statistical Protocol for the Analysis of Metabolic Toxicological Data Derived from NMR Spectroscopy , 2007, 2007 IEEE 7th International Symposium on BioInformatics and BioEngineering.

[2]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.

[3]  Marcelo Arenas,et al.  Semantics and complexity of SPARQL , 2006, TODS.

[4]  Wang Chiew Tan Provenance in Databases: Past, Current, and Future , 2007, IEEE Data Eng. Bull..

[5]  PlaleBeth,et al.  A survey of data provenance in e-science , 2005 .

[6]  Moshe Y. Vardi The complexity of relational query languages (Extended Abstract) , 1982, STOC '82.

[7]  A. Rector,et al.  Relations in biomedical ontologies , 2005, Genome Biology.

[8]  Dean Allemang,et al.  The Semantic Web - ISWC 2006, 5th International Semantic Web Conference, ISWC 2006, Athens, GA, USA, November 5-9, 2006, Proceedings , 2006, SEMWEB.

[9]  Amit P. Sheth,et al.  Knowledge modeling and its application in life sciences: a tale of two ontologies , 2006, WWW '06.

[10]  Amit P. Sheth,et al.  Provenance Algebra and Materialized View-Based Provenance Management , 2008 .

[11]  Philippa Rhodes,et al.  ApiDB: integrated resources for the apicomplexan bioinformatics resource center , 2006, Nucleic Acids Res..

[12]  C. Renée James Where did you come from , 2008 .

[13]  Amit P. Sheth,et al.  Semantic Provenance for eScience: Managing the Deluge of Scientific Data , 2008, IEEE Internet Computing.

[14]  Robin Milner,et al.  Grand Challenges for Computing Research , 2005, Comput. J..

[15]  Eugene Inseok Chong,et al.  An Efficient SQL-based RDF Querying Scheme , 2005, VLDB.