Computing Environments for Reproducibility: Capturing the "Whole Tale"

The act of sharing scientific knowledge is rapidly evolving away from traditional articles and presentations to the delivery of executable objects that integrate the data and computational details (e.g., scripts and workflows) upon which the findings rely. This envisioned coupling of data and process is essential to advancing science but faces technical and institutional barriers. The Whole Tale project aims to address these barriers by connecting computational, data-intensive research efforts with the larger research process--transforming the knowledge discovery and dissemination process into one where data products are united with research articles to create "living publications" or "tales". The Whole Tale focuses on the full spectrum of science, empowering users in the long tail of science, and power users with demands for access to big data and compute resources. We report here on the design, architecture, and implementation of the Whole Tale environment.

[1]  Richard Grunzke,et al.  Science gateways - leveraging modeling and simulations in HPC infrastructures via increased usability , 2015, 2015 International Conference on High Performance Computing & Simulation (HPCS).

[2]  Ian T. Foster,et al.  Globus Data Publication as a Service: Lowering Barriers to Reproducible Science , 2015, 2015 IEEE 11th International Conference on e-Science.

[3]  Brian A. Nosek,et al.  Promoting an open research culture , 2015, Science.

[4]  Ben Marwick,et al.  Packaging Data Analytical Work Reproducibly Using R (and Friends) , 2018 .

[5]  J. Ioannidis,et al.  Public Availability of Published Research Data in High-Impact Journals , 2011, PloS one.

[6]  Clifford A. Lynch,et al.  Cultural Dynamics, Deep Time, and Data , 2015, Advances in Archaeological Practice.

[7]  Helen Shen,et al.  Interactive notebooks: Sharing the code , 2014, Nature.

[8]  Yaxing Wei,et al.  YesWorkflow: A User-Oriented, Language-Independent Tool for Recovering Workflow Information from Scripts , 2015, ArXiv.

[9]  Steve Kelling,et al.  Participatory design of DataONE - Enabling cyberinfrastructure for the biological and environmental sciences , 2012, Ecol. Informatics.

[10]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[11]  Jonathan M. Borwein,et al.  Setting the Default to Reproducible Reproducibility in Computational and Experimental Mathematics , 2013 .

[12]  Yadu N. Babuji,et al.  Cloud Kotta: Enabling secure and scalable data analytics in the cloud , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[13]  I. Foster,et al.  The Materials Data Facility: Data Services to Advance Materials Science Research , 2016, JOM.

[14]  Juliana Freire,et al.  Reproducibility of Data-Oriented Experiments in e-Science (Dagstuhl Seminar 16041) , 2016, Dagstuhl Reports.

[15]  Yolanda Gil,et al.  PROV-DM: The PROV Data Model , 2013 .

[16]  Ian T. Foster,et al.  Efficient and Secure Transfer, Synchronization, and Sharing of Big Data , 2014, IEEE Cloud Computing.

[17]  Michael McLennan,et al.  HUBzero: A Platform for Dissemination and Collaboration in Computational Science and Engineering , 2010, Computing in Science & Engineering.

[18]  V. Stodden Intellectual Property and Computational Science , 2014 .

[19]  R. Peng Reproducible Research in Computational Science , 2011, Science.

[20]  Ian T. Foster,et al.  Globus auth: A research identity and access management platform , 2016, 2016 IEEE 12th International Conference on e-Science (e-Science).

[21]  Daniel S. Katz,et al.  Software citation principles , 2016, PeerJ Comput. Sci..

[22]  Victoria Stodden,et al.  ResearchCompendia.org: Cyberinfrastructure for Reproducibility and Collaboration in Computational Science , 2015, Computing in Science & Engineering.

[23]  Bertram Ludäscher,et al.  Yin & Yang: Demonstrating Complementary Provenance from noWorkflow & YesWorkflow , 2016, IPAW.

[24]  Nancy Wilkins-Diehr,et al.  Special Issue: Science Gateways—Common Community Interfaces to Grid Resources , 2007, Concurr. Comput. Pract. Exp..

[25]  Marco Buongiorno Nardelli,et al.  The high-throughput highway to computational materials design. , 2013, Nature materials.

[26]  Bryce Meredig,et al.  Materials Data Infrastructure: A Case Study of the Citrination Platform to Examine Data Import, Storage, and Access , 2016 .

[27]  Wilkins-DiehrNancy Special Issue: Science GatewaysCommon Community Interfaces to Grid Resources , 2007 .

[28]  Yadu N. Babuji,et al.  A secure data enclave and analytics platform for social scientists , 2016, 2016 IEEE 12th International Conference on e-Science (e-Science).

[29]  Nancy Wilkins-Diehr,et al.  Standing Together for Reproducibility in Large-Scale Computing: Report on reproducibility@XSEDE , 2014, ArXiv.

[30]  Carl Lagoze,et al.  The Open Archives Initiative Protocol for Metadata Harvesting Protocol , 2002 .

[31]  Yolanda Gil,et al.  PROV Model Primer , 2012 .

[32]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[33]  Yaxing Wei,et al.  DataONE: A Data Federation with Provenance Support , 2016, IPAW.

[34]  Jon F. Claerbout,et al.  Electronic documents give reproducible research a new meaning: 62nd Ann , 1992 .

[35]  B. Meredig,et al.  Materials science with large-scale data and informatics: Unlocking new opportunities , 2016 .

[36]  Sean Bechhofer,et al.  Research Objects: Towards Exchange and Reuse of Digital Knowledge , 2010 .

[37]  Bertram Ludäscher,et al.  Retrospective Provenance Without a Runtime Provenance Recorder , 2015, TaPP.

[38]  Henry T. Wright,et al.  Grand Challenges for Archaeology , 2014, American Antiquity.

[39]  Florence Debarre,et al.  The Availability of Research Data Declines Rapidly with Article Age , 2013, Current Biology.

[40]  Brian O'Shea,et al.  The first Population II stars formed in externally enriched mini-haloes , 2015, 1504.07639.

[41]  Yoshihisa Kashima,et al.  Cultural dynamics. , 2016, Current opinion in psychology.

[42]  Kristian S. Thygesen,et al.  Making the most of materials computations , 2016, Science.

[43]  V. Stodden,et al.  Toward Reproducible Computational Research: An Empirical Analysis of Data and Code Policy Adoption by Journals , 2013, PloS one.

[44]  Alexander Tropsha,et al.  Materials Informatics , 2019, J. Chem. Inf. Model..

[45]  Dmitry Medvedev,et al.  SciServer Compute brings Analysis to Big Data in the Cloud , 2016 .

[46]  Yolanda Gil,et al.  Enhancing reproducibility for computational methods , 2016, Science.

[47]  Arian Maleki,et al.  Reproducible Research in Computational Harmonic Analysis , 2009, Computing in Science & Engineering.

[48]  M. Norman,et al.  yt: A MULTI-CODE ANALYSIS TOOLKIT FOR ASTROPHYSICAL SIMULATION DATA , 2010, 1011.3514.

[49]  Yang Cao,et al.  Revealing the Detailed Lineage of Script Outputs Using Hybrid Provenance , 2017, Int. J. Digit. Curation.

[50]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[51]  Victoria Stodden,et al.  RunMyCode.org: A novel dissemination and collaboration platform for executing published computational results , 2012, 2012 IEEE 8th International Conference on E-Science.

[52]  A. Tsai,et al.  Nonequilibrium phase diagrams of ternary amorphous alloys , 1997 .

[53]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[54]  Sandra Lowe,et al.  Handbook of Materials Modeling , 2020 .

[55]  Carly Strasser,et al.  Data publication consensus and controversies , 2014, F1000Research.

[56]  Yonina C. Eldar,et al.  Sensing Matrix Optimization for Block-Sparse Decoding , 2010, IEEE Transactions on Signal Processing.

[57]  Sheila S. Hemami,et al.  The first IEEE workshop on the Future of Research Curation and Research Reproducibility , 2017 .

[58]  Sarah Callaghan,et al.  Joint declaration of data citation principles , 2014 .

[59]  Mercè Crosas,et al.  The Dataverse Network®: An Open-Source Application for Sharing, Discovering and Preserving Data , 2011, D Lib Mag..

[60]  Victoria Stodden,et al.  The Legal Framework for Reproducible Scientific Research: Licensing and Copyright , 2009, Computing in Science & Engineering.

[61]  Brigid Wilson,et al.  Implementing Reproducible Research , 2014 .

[62]  Kristin A. Persson,et al.  Commentary: The Materials Project: A materials genome approach to accelerating materials innovation , 2013 .

[63]  Alok Choudhary,et al.  A General-Purpose Machine Learning Framework for Predicting Properties of Inorganic Materials , 2016 .

[64]  Lihi Zelnik-Manor,et al.  SIFTpack: A Compact Representation for Efficient SIFT Matching , 2013, 2013 IEEE International Conference on Computer Vision.

[65]  Bladimir Díaz Borges Berlin Declaration on Open Access to Knowledge in the Sciences and Humanities , 2008 .

[66]  Tucson,et al.  The AGORA High-resolution Galaxy Simulations Comparison Project. III. Cosmological Zoom-in Simulation of a Milky Way–mass Halo , 2013, The Astrophysical Journal.

[67]  Francine Berman,et al.  Realizing the potential of data science , 2018, Commun. ACM.

[68]  Matthew J. Turk,et al.  Capturing the "Whole Tale" of Computational Research: Reproducibility in Computing Environments , 2016, ArXiv.

[69]  Wang Jun Open Archives Initiative Protocol for Metadata Harvesting , 2005 .

[70]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.