The Story in the Notebook: Exploratory Data Science using a Literate Programming Tool

Literate programming tools are used by millions of programmers today, and are intended to facilitate presenting data analyses in the form of a narrative. We interviewed 21 data scientists to study coding behaviors in a literate programming environment and how data scientists kept track of variants they explored. For participants who tried to keep a detailed history of their experimentation, both informal and formal versioning attempts led to problems, such as reduced notebook readability. During iteration, participants actively curated their notebooks into narratives, although primarily through cell structure rather than markdown explanations. Next, we surveyed 45 data scientists and asked them to envision how they might use their past history in an future version control system. Based on these results, we give design guidance for future literate programming tools, such as providing history search based on how programmers recall their explorations, through contextual details including images and parameters.

[1]  Clemens Nylandsted Klokmose,et al.  Rethinking Laboratory Notebooks , 2010, COOP.

[2]  Wendy E. Mackay,et al.  From individual to collaborative: the evolution of prism, a hybrid laboratory notebook , 2008, CSCW.

[3]  Helen Shen,et al.  Interactive notebooks: Sharing the code , 2014, Nature.

[4]  Rachel K. E. Bellamy,et al.  Trials and tribulations of developers of intelligent systems: A field study , 2016, 2016 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC).

[5]  et al.,et al.  Jupyter Notebooks - a publishing format for reproducible computational workflows , 2016, ELPUB.

[6]  Scott E. Hudson,et al.  Debugging lenses: a new class of transparent tools for user interface debugging , 1997, UIST '97.

[7]  Karen Holtzblatt,et al.  Contextual design: using customer work models to drive systems design , 1998, CHI Conference Summary.

[8]  A. Strauss,et al.  Grounded theory , 2017 .

[9]  Brad A. Myers,et al.  Visualization of fine-grained code change history , 2013, 2013 IEEE Symposium on Visual Languages and Human Centric Computing.

[10]  Yihui Xie,et al.  knitr: A Comprehensive Tool for Reproducible Research in R , 2018, Implementing Reproducible Research.

[11]  Natasa Milic-Frayling,et al.  Study of electronic lab notebook design and practices that emerged in a collaborative scientific environment , 2014, CSCW.

[12]  Janice Singer,et al.  How software engineers use documentation: the state of the practice , 2003, IEEE Software.

[13]  Margo I. Seltzer,et al.  BURRITO: Wrapping Your Lab Notebook in Computational Infrastructure , 2012, TaPP.

[14]  Badrish Chandramouli,et al.  Tempe: An Interactive Data Science Environment for Exploration of Temporal and Streaming Data , 2014 .

[15]  Margaret M. Burnett,et al.  Foraging Among an Overabundance of Similar Variants , 2016, CHI.

[16]  Brian E. Granger,et al.  IPython: A System for Interactive Scientific Computing , 2007, Computing in Science & Engineering.

[17]  James A. Bednar,et al.  An automated and reproducible workflow for running and analyzing neural simulations using Lancet and IPython Notebook , 2013, Front. Neuroinform..

[18]  Robert C. Martin Clean Code - a Handbook of Agile Software Craftsmanship , 2008 .

[19]  Philip J. Guo,et al.  Opportunistic programming: how rapid ideation and prototyping occur in practice , 2008, WEUSE@ICSE.

[20]  Karen Holtzblatt,et al.  Contextual design , 1997, INTR.

[21]  Kayur Patel,et al.  Lowering the barrier to applying machine learning , 2010, CHI Extended Abstracts.

[22]  Brad A. Myers,et al.  Variolite: Supporting Exploratory Programming by Data Scientists , 2017, CHI.

[23]  Greg Wilson,et al.  Software Carpentry: Getting Scientists to Write Better Code by Making Them More Productive , 2006, Computing in Science & Engineering.

[24]  Donald E. Knuth,et al.  Literate Programming , 1984, Comput. J..

[25]  James A. Landay,et al.  Investigating statistical machine learning as a tool for software development , 2008, CHI.

[26]  David Lorge Parnas,et al.  Software aging , 1994, Proceedings of 16th International Conference on Software Engineering.