SDTA: An Algebra for Statistical Data Transformation

Statistical data manipulation is a crucial component of many data science analytic pipelines, particularly as part of data ingestion. This task is generally accomplished by writing transformation scripts in languages such as SPSS, Stata, SAS, R, Python (Pandas) and etc. The disparate data models, language representations and transformation operations supported by these tools make it hard for end users to understand and document the transformations performed, and for developers to port transformation code across languages. Tackling these challenges, we present a formal paradigm for statistical data transformation. It consists of a data model, called Structured Data Transformation Data Model (SDTDM), inspired by the data models of multiple statistical transformations frameworks; an algebra, Structural Data Transformation Algebra (SDTA), with the ability to transform not only data within SDTDM but also metadata at multiple structural levels; and an equivalent descriptive counterpart, called Structured Data Transformation Language (SDTL), recently adopted by the DDI Alliance that maintains international standards for metadata as part of its suite of products. Experiments with real statistical transformations on socio-economic data show that SDTL can successfully represent 86.1% and 91.6% respectively of 4,185 commands in SAS and 9,087 commands in SPSS obtained from a repository. We illustrate with examples how SDTA/SDTL could assist with the documentation of statistical data transformation, an important aspect often neglected in metadata of datasets. We propose a system called C2Metadata that automatically captures the transformation and provenance information in SDTL as a part of the metadata. Moreover, given the conversion mechanism from a source statistical language to SDTA/SDTL, we show how functional-equivalent transformation programs could be converted to other functionally equivalent programs, in the same or different language, permitting code reuse and result reproducibility, We also illustrate the possibility of using of SDTA to optimize SDTL transformations using rule-based rewrites similar to SQL optimizations.

[1]  E. F. Haghish Seamless interactive language interfacing between R and Stata , 2019, The Stata Journal: Promoting communications on statistics and Stata.

[2]  Hadley Wickham,et al.  Reshaping Data with the reshape Package , 2007 .

[3]  Jim Melton,et al.  Advanced SQL:1999: Understanding Object-Relational and Other Advanced Features , 2002 .

[4]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[5]  Gultekin Özsoyoglu,et al.  Extending relational algebra and relational calculus with set-valued attributes and aggregate functions , 1987, TODS.

[6]  Steven J. DeRose,et al.  XML Path Language (XPath) , 1999 .

[7]  Limsoon Wong,et al.  A query language for multidimensional arrays: design, implementation, and optimization techniques , 1996, SIGMOD '96.

[8]  Arie Shoshani,et al.  Statistical Databases: Characteristics, Problems, and some Solutions , 1982, VLDB.

[9]  Eugene J. Shekita,et al.  Fundamental techniques for order optimization , 1996, SIGMOD '96.

[10]  Martin L. Kersten,et al.  MIL primitives for querying a fragmented world , 1999, The VLDB Journal.

[11]  Abraham Silberschatz,et al.  Extended algebra and calculus for nested relational databases , 1988, TODS.

[12]  Peter Baumann,et al.  A Comparative Analysis of Array Models for Databases , 2011, FGIT-DTA/BSBT.

[13]  Georg Gottlob,et al.  Closed World Databases Opened Through Null Values , 1988, VLDB.

[14]  Edward L. Robertson,et al.  Relational languages for metadata integration , 2005, TODS.

[15]  Jeffrey Heer,et al.  Wrangler: interactive visual specification of data transformation scripts , 2011, CHI.

[16]  Alex van Ballegooij RAM: A Multidimensional Array DBMS , 2004, EDBT Workshops.

[17]  Roberto Cornacchia,et al.  Flexible and efficient IR using array databases , 2007, The VLDB Journal.

[18]  Michael Stonebraker,et al.  The Architecture of SciDB , 2011, SSDBM.

[19]  Hosagrahar V. Jagadish,et al.  Provenance metadata for statistical data: An introduction to Structured Data Transformation Language (SDTL) , 2020, IASSIST Quarterly.

[20]  Michael Dixon,et al.  Google Earth Engine: Planetary-scale geospatial analysis for everyone , 2017 .

[21]  Roger G. Johnson,et al.  Extending relational algebra to manipulate temporal data , 1988, Inf. Syst..

[22]  H. V. Jagadish,et al.  C2Metadata: Automating the Capture of Data Transformations from Statistical Scripts in Data Documentation , 2019, SIGMOD Conference.

[23]  Klaus R. Dittrich,et al.  Data Provenance: A Categorization of Existing Approaches , 2007, BTW.

[24]  Carlo Zaniolo Database relations with null values , 1982, PODS '82.

[25]  Peter Baumann,et al.  The multidimensional database system RasDaMan , 1998, SIGMOD '98.

[26]  Peter Baumann,et al.  Management of multidimensional discrete data , 1994, The VLDB Journal.

[27]  Dennis Shasha,et al.  AQuery: Query Language for Ordered Data, Optimization Techniques, and Experiments , 2003, VLDB.

[28]  F. E.,et al.  A Relational Model of Data Large Shared Data Banks , 2000 .

[29]  Christian S. Jensen,et al.  Bringing order to query optimization , 2002, SGMD.

[30]  Z. Meral Ozsoyoglu,et al.  A Query Language for Statistical Databases , 1985 .

[31]  Kenneth Salem,et al.  A Language for Manipulating Arrays , 1997, VLDB.

[32]  Arie Shoshani,et al.  OLAP and statistical databases: similarities and differences , 1997, PODS '97.

[33]  Laks V. S. Lakshmanan,et al.  SchemaSQL - A Language for Interoperability in Relational Multi-Database Systems , 1996, VLDB.