An Inference-based Framework for Managing Data Provenance
暂无分享,去创建一个
Scientists can facilitate data intensive applications to study and understand the behavior of a complex system. In a data intensive application, a scientific model facilitates raw data products, collected from various sources, to produce new data products. Based on the generated output, scientists used to make decisions that could potentially affect the system which is being studied. Therefore, it is important to have the ability of tracing an output data product back to its source values if that particular output seems to have an unexpected value.
Data provenance helps scientists to investigate the origin of an unexpected value. Provenance could be also used to validate a scientific model. Existing provenance-aware systems have their own set of constructs to design the workflow of a scientific model for extracting workflow provenance. Using these systems requires extensive training for scientists. Preparing workflow provenance manually is also not a feasible option since it is a time consuming task. Moreover, the existing systems document provenance records explicitly to build a fine-grained provenance trace which is used for tracing back to source data. Since most of the scientific computations handle massive amounts of data, the storage overhead to maintain provenance data becomes a major concern.
We address the aforesaid challenges by introducing a framework managing both workflow and fine-grained data provenance in a generic and cost-efficient way. The framework is capable of extracting workflow provenance of a scientific model automatically at reduced effort and time. It also infers fine-grained data provenance without explicit documentation of provenance records. Therefore, the framework reduces the storage consumption to maintain provenance data. We introduce a suite of inference-based methods addressing different execution environments to make the framework more generic in nature. Moreover, the framework has the self-adaptability feature so that it can provide optimally accurate provenance at minimal storage costs. Our evaluation based on two use cases shows that the framework provides a generic, cost-efficient solution to scientists who want to manage data provenance for their data intensive applications.