A Large-Scale Study About Quality and Reproducibility of Jupyter Notebooks

Jupyter Notebooks have been widely adopted by many different communities, both in science and industry. They support the creation of literate programming documents that combine code, text, and execution results with visualizations and all sorts of rich media. The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices, and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we studied 1.4 million notebooks from GitHub. We present a detailed analysis of their characteristics that impact reproducibility. We also propose a set of best practices that can improve the rate of reproducibility and discuss open challenges that require further research and development.

[1]  J. V. Gurp,et al.  Separation of Concerns : A Case Study , 2001 .

[2]  Dennis Shasha,et al.  ReproZip: Using Provenance to Support Computational Reproducibility , 2013, TaPP.

[3]  Ian M. Mitchell,et al.  Best Practices for Scientific Computing , 2012, PLoS biology.

[4]  Vahid Garousi,et al.  Smells in software test code: A survey of knowledge in industry and academia , 2018, J. Syst. Softw..

[5]  Daniela E. Damian,et al.  The promises and perils of mining GitHub , 2009, MSR 2014.

[6]  Birgitta König-Ries,et al.  ProvBook: Provenance-based Semantic Enrichment of Interactive Notebooks for Reproducibility , 2018, SEMWEB.

[7]  Diane Kelly,et al.  Testing for trustworthiness in scientific software , 2009, 2009 ICSE Workshop on Software Engineering for Computational Science and Engineering.

[8]  Meiyappan Nagappan,et al.  Curating GitHub for engineered software projects , 2016, PeerJ Prepr..

[9]  Cláudio T. Silva,et al.  Provenance for Computational Tasks: A Survey , 2008, Computing in Science & Engineering.

[10]  David Koop,et al.  Dataflow Notebooks: Encoding and Tracking Dependencies of Cells , 2017, TaPP.

[11]  Giuliano Antoniol,et al.  Linguistic antipatterns: what they are and how developers perceive them , 2015, Empirical Software Engineering.

[12]  Thomas W. Reps,et al.  The use of program dependence graphs in software engineering , 1992, International Conference on Software Engineering.

[13]  Brad A. Myers,et al.  The Story in the Notebook: Exploratory Data Science using a Literate Programming Tool , 2018, CHI.

[14]  Juliana Freire,et al.  Collecting and Analyzing Provenance on Interactive Notebooks: When IPython Meets noWorkflow , 2015, TaPP.

[15]  Glenford J. Myers,et al.  Art of Software Testing , 1979 .

[16]  Donald E. Knuth,et al.  Literate Programming , 1984, Comput. J..

[17]  James D. Hollan,et al.  Exploration and Explanation in Computational Notebooks , 2018, CHI.

[18]  Brian E. Granger,et al.  IPython: A System for Interactive Scientific Computing , 2007, Computing in Science & Engineering.

[19]  Vadim Zaytsev,et al.  Does Python Smell Like Java? Tool Support for Design Defect Discovery in Python , 2017, Art Sci. Eng. Program..

[20]  Gabriele Bavota,et al.  There and back again: Can you compile that snapshot? , 2017, J. Softw. Evol. Process..

[21]  Christian Collberg,et al.  Measuring Reproducibility in Computer Systems Research , 2014 .

[22]  et al.,et al.  Jupyter Notebooks - a publishing format for reproducible computational workflows , 2016, ELPUB.

[23]  Helen Shen,et al.  Interactive notebooks: Sharing the code , 2014, Nature.