Datasets are often derived by manipulating raw data with statistical software packages. The derivation of a dataset must be recorded in terms of both the raw input and the manipulations applied to it. Statistics packages typically provide limited help in documenting provenance for the resulting derived data. At best, the operations performed by the statistical package are described in a script. Disparate representations make these scripts hard to understand for users. To address these challenges, we created Continuous Capture of Metadata (C2Metadata), a system to capture data transformations in scripts for statistical packages and represent it as metadata in a standard format that is easy to understand. We do so by devising a Structured Data Transformation Algebra (SDTA), which uses a small set of algebraic operators to express a large fraction of data manipulation performed in practce. We then implement SDTA, inspired by relational algebra, in a data transformation specification language we call SDTL. In this demonstration, we showcase C2metadata's capture of data transformations from a pool of sample transformation scripts in at least two languages: SPSS and Stata (SAS and R are under development), for social science data in a large academic repository. We will allow the audience to explore C2Metadata using a web-based interface, visualize the intermediate steps and trace the provenance and changes of data at different levels for better understanding of the process.
[1]
Matthew Jones,et al.
Maximizing the Value of Ecological Data with Structured Metadata: An Introduction to Ecological Metadata Language (EML) and Principles for Metadata Creation
,
2005
.
[2]
Russell W. Quong,et al.
ANTLR: A predicated‐LL(k) parser generator
,
1995,
Softw. Pract. Exp..
[3]
Klaus R. Dittrich,et al.
Data Provenance: A Categorization of Existing Approaches
,
2007,
BTW.
[4]
Pascal Heus,et al.
Data Documentation Initiative: Toward a Standard for the Social Sciences
,
2008,
Int. J. Digit. Curation.
[5]
R Core Team,et al.
R: A language and environment for statistical computing.
,
2014
.