A Multi-level Funneling Approach to Data Provenance Reconstruction

When data are retrieved from a file storage system or the Internet, is there information about their provenance (i.e., their origin or history)? It is possible that data could have been copied from another source and then transformed. Often, provenance is not readily available for data sets created in the past. Solving such a problem is the motivation behind the 2014 Provenance Reconstruction Challenge. This challenge is aimed at recovering lost provenance for two data sets: one data set (WikiNews articles) in which a list of possible sources has been provided, and another data set (files from GitHub repositories) in which the file sources are not provided. To address this challenge, we present a multi-level funneling approach to provenance reconstruction, a technique that incorporates text processing techniques from different disciplines to approximate the provenance of a given data set. We built three prototypes using this technique and evaluated them using precision and recall metrics. Our preliminary results indicate that our technique is capable of reconstructing some of the lost provenance.