A Data Restore Model for Reproducibility in Computational Statistics

Researchers are more and more requested to publish their scientific data sets for purposes of transparency, re-use, and reproducibility. Particularly in economics and the social sciences, researchers often use sensitive statistical data that underlie protection policies which inhibit distribution to third party archives. In addition, a considerable quantity of data sets combines data from one or more external providers, which complicates the setting for curation-related activities. These circumstances give us reason to pursue a data restore model on the basis of fine-grained referencing that allows to trace data provenance to the original archive in charge of curation. One goal is to enable data publication in difficult cases, and another one is to show how the gaps between data citation and code integration can be closed in order to eliminate all manual efforts of arranging code and data files for reproduction attempts. On this basis we develop the requirements for a data restore model and elaborate a generic design in view of an overall data management infrastructure. We further explore an experimental implementation which we validate by taking the example of a real-world publication in economics. Eventually we close with the vision of a data and code ontology that carries statistical models from paper to a re-usable semantic level.

[1]  Achim Zeileis,et al.  On reproducible econometric research , 2009 .

[2]  Robert Gentleman,et al.  Statistical Analyses and Reproducible Research , 2007 .

[3]  Jan de Leeuw,et al.  Reproducible Research: the Bottom Line , 2001 .

[4]  Torsten Hothorn,et al.  Executable Papers for the R Community: The R2 Platform for Reproducible Research , 2011, ICCS.

[5]  Markus Rupp,et al.  Reproducible research in signal processing , 2009, IEEE Signal Processing Magazine.

[6]  Yuichi Mori,et al.  How Computational Statistics Became the Backbone of Modern Data Science , 2011 .

[7]  Jesús M. González-Barahona,et al.  On the reproducibility of empirical software engineering studies based on data retrieved from development repositories , 2011, Empirical Software Engineering.

[8]  David L. Donoho,et al.  A Universal Identifier for Computational Results , 2011, ICCS.

[9]  Wilhelm Hasselbring,et al.  PubFlow: provenance-aware workflows for research data publication , 2013 .

[10]  Yves Zenou,et al.  ERRATA CORRIGE: “ARE MUSLIM IMMIGRANTS DIFFERENT IN TERMS OF CULTURAL INTEGRATION?” , 2011, Journal of the European Economic Association.

[11]  Michael Hausenblas,et al.  Building Linked Data For Both Humans and Machines , 2008, LDOW.

[12]  Victoria Stodden,et al.  Reproducible Research , 2019, The New Statistics with R.

[13]  R. Peng Reproducible Research in Computational Science , 2011, Science.

[14]  Tomi Kauppinen,et al.  Linked Science: Interconnecting Scientific Assets , 2013 .

[15]  Friedrich Leisch,et al.  Sweave: Dynamic Generation of Statistical Reports Using Literate Data Analysis , 2002, COMPSTAT.

[16]  Giovanni Baiocchi,et al.  Reproducible research in computational economics: guidelines, integrated approaches, and open source software , 2007 .

[17]  TIM M. BLACKBURN,et al.  Reproducibility and Repeatability in Ecology , 2006 .

[18]  A. J. Rossini,et al.  Literate Statistical Practice , 2003 .

[19]  Piotr Nowakowski,et al.  The Collage Authoring Environment , 2011, ICCS.

[20]  Carole A. Goble,et al.  Why Linked Data is Not Enough for Scientists , 2010, 2010 IEEE Sixth International Conference on e-Science.

[21]  Matthias Schwab,et al.  Making scientific computations reproducible , 2000, Comput. Sci. Eng..

[22]  Dirk von Suchodoletz,et al.  Emulation based services in digital preservation , 2010, JCDL '10.

[23]  B. D. McCullough,et al.  Got Replicability? The _Journal of Money, Credit and Banking_ Archive , 2007 .

[24]  Jonas Karlsson,et al.  On Fragile Grounds: A replication of "Are Muslim immigrants different in terms of cultural integration?" , 2009 .

[25]  Andreas Rauber Digital Preservation in Data-Driven Science: On the Importance of Process Capture, Preservation and Validation , 2012, SDA.

[26]  J. Nazroo,et al.  Rethinking the relationship between ethnicity and mental health: the British Fourth National Survey of Ethnic Minorities , 1998, Social Psychiatry and Psychiatric Epidemiology.

[27]  Hazhir Rahmandad,et al.  Reporting guidelines for simulation‐based research in social sciences , 2012 .

[28]  Torsten Hothorn,et al.  Case studies in reproducibility , 2011, Briefings Bioinform..

[29]  Klaus Tochtermann,et al.  Addressing the long tail in empirical research data management , 2012, i-KNOW '12.