Semantically rich metadata is foreseen to be pervasive in tomorrow’s cyber world. People are more willing to store metadata in the hope that such extra information will enable a wide range of novel business intelligent applications. Provenance is metadata which describes the derivation history of data. It is considered to have great potential for helping the reasoning, analyzing, validating, monitoring, integrating and reusing of data. In this paper, we introduce Butterfly, a provenance management system, which offers the modeling, storage, and query of provenance. 1 Motivation With today’s abundant computer storage and powerful processing capability, people become more and more aggressive in collecting extra data: data intentionally generated to assist the understanding of other data or processes. Simple form of model and query of such data can not satisfy non-expert users’ growing appetite for intelligent support in applications. For example, • an online catalog vendor wants to track the interaction of customers with the UI to discover the sequential pattern of operations which ends in purchasing a product; find the most visited (used) web page (interface) to improve user experience; query the connection between two visited web pages to better understand user behavior and enhance cross-selling. • a scientist wants to log detailed running steps and intermediate results of an experiment saving the opportunity for future inspection or reproduction of the result; providing poof to peer scientists about the authenticity of the experiment; contributing to the pool of experimental recipes for reuse. • a food manufacturer wants to record the production and distribution process of a product, so whenever the product is found flawed, it is possible to trace back to the origin of the problem; map the affected vendors, shops, and regions; estimate the ensuing loss and compensation; submit report to supervisory authority for conformity check. There are some common patterns in the aforementioned scenarios:
[1]
Yong Zhao,et al.
Chimera: a virtual data system for representing, querying, and automating data derivation
,
2002,
Proceedings 14th International Conference on Scientific and Statistical Database Management.
[2]
Yogesh L. Simmhan,et al.
A survey of data provenance techniques
,
2005
.
[3]
James Frew,et al.
Lineage retrieval for scientific data processing: a survey
,
2005,
CSUR.
[4]
Amin Vahdat,et al.
Transparent Result Caching
,
1997,
USENIX Annual Technical Conference.
[5]
Paul T. Groth,et al.
The Requirements of Using Provenance in e-Science Experiments
,
2007,
Journal of Grid Computing.
[6]
Jennifer Widom,et al.
Tracing the lineage of view data in a warehousing environment
,
2000,
TODS.
[7]
Robert Stevens,et al.
Annotating, Linking and Browsing Provenance Logs for {e-Science}
,
2003
.
[8]
Richard A. Becker,et al.
Auditing of Data Analyses
,
1986,
SSDBM.
[9]
James Frew,et al.
Composing lineage metadata with XML for custom satellite-derived data products
,
2004,
Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..
[10]
Luc Moreau,et al.
The Open Provenance Model
,
2007
.
[11]
Kaizar Amin,et al.
Metadata in the Collaboratory for Multi-Scale Chemical Science
,
2003,
Dublin Core Conference.
[12]
D. Lanter.
Design of a Lineage-Based Meta-Data Base for GIS
,
1991
.
[13]
Wang Chiew Tan,et al.
An annotation management system for relational databases
,
2004,
The VLDB Journal.
[14]
Jennifer Widom,et al.
An Introduction to ULDBs and the Trio System
,
2006,
IEEE Data Eng. Bull..
[15]
Paul T. Groth,et al.
An Architecture for Provenance Systems
,
2006
.
[16]
Carole A. Goble,et al.
myGrid: personalised bioinformatics on the information grid
,
2003,
ISMB.