Data Science Through the Looking Glass

The recent success of machine learning (ML) has led to an explosive growth of systems and applications built by an ever-growing community of system builders and data science (DS) practitioners. This quickly shifting panorama, however, is challenging for system builders and practitioners alike to follow. In this paper, we set out to capture this panorama through a wide-angle lens, performing the largest analysis of DS projects to date, focusing on questions that can advance our understanding of the field and determine investments. Specifically, we download and analyze (a) over 8M notebooks publicly available on G IT H UB and (b) over 2M enterprise ML pipelines developed within Microsoft. Our analysis includes coarse-grained statistical characterizations, fine-grained analysis of libraries and pipelines, and comparative studies across datasets and time. We report a large number of measurements for our readers to interpret and draw actionable conclusions on (a) what system builders should focus on to better serve practitioners and (b) what technologies should practitioners rely on.

[1]  Praveen K. Paritosh,et al.  “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI , 2021, CHI.

[2]  Cong Yan,et al.  Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks , 2020, SIGMOD Conference.

[3]  Amy X. Zhang,et al.  How do Data Science Workers Collaborate? Roles, Workflows, and Tools , 2020, Proc. ACM Hum. Comput. Interact..

[4]  Markus Weimer,et al.  Vamsa: Automated Provenance Tracking in Data Science Scripts , 2020, KDD.

[5]  Carlo Curino,et al.  Data Science through the looking glass and what we found there , 2019, ArXiv.

[6]  Carlo Curino,et al.  Extending Relational Query Processing with ML Inference , 2019, CIDR.

[7]  Markus Weimer,et al.  Cloudy with high chance of DBMS: a 10-year prediction for Enterprise-Grade ML , 2019, CIDR.

[8]  Ethan Bommarito,et al.  An Empirical Analysis of the Python Package Index (PyPI) , 2019, SSRN Electronic Journal.

[9]  Yiwen Zhu,et al.  Machine Learning at Microsoft with ML.NET , 2019, KDD.

[10]  Michael J. Muller,et al.  How Data Science Workers Work with Data: Discovery, Capture, Curation, Design, Creation , 2019, CHI.

[11]  Harald C. Gall,et al.  Software Engineering for Machine Learning: A Case Study , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).

[12]  James D. Hollan,et al.  Exploration and Explanation in Computational Notebooks , 2018, CHI.

[13]  Brad A. Myers,et al.  Variolite: Supporting Exploratory Programming by Data Scientists , 2017, CHI.

[14]  Tom Mens,et al.  On the topology of package dependency networks: a comparison of three programming language ecosystems , 2016, ECSA Workshops.

[15]  Rachel K. E. Bellamy,et al.  Trials and tribulations of developers of intelligent systems: A field study , 2016, 2016 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC).

[16]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[17]  Ellis Horowitz,et al.  Software Cost Estimation with COCOMO II , 2000 .

[18]  Sebastian Schelter,et al.  Automatically Tracking Metadata and Provenance of Machine Learning Experiments , 2017 .