What Do Developers Ask About ML Libraries? A Large-scale Study Using Stack Overflow

Modern software systems are increasingly including machine learning (ML) as an integral component. However, we do not yet understand the difficulties faced by software developers when learning about ML libraries and using them within their systems. To that end, this work reports on a detailed (manual) examination of 3,243 highly-rated Q&A posts related to ten ML libraries, namely Tensorflow, Keras, scikit-learn, Weka, Caffe, Theano, MLlib, Torch, Mahout, and H2O, on Stack Overflow, a popular online technical Q&A forum. We classify these questions into seven typical stages of an ML pipeline to understand the correlation between the library and the stage. Then we study the questions and perform statistical analysis to explore the answer to four research objectives (finding the most difficult stage, understanding the nature of problems, nature of libraries and studying whether the difficulties stayed consistent over time). Our findings reveal the urgent need for software engineering (SE) research in this area. Both static and dynamic analyses are mostly absent and badly needed to help developers find errors earlier. While there has been some early research on debugging, much more work is needed. API misuses are prevalent and API design improvements are sorely needed. Last and somewhat surprisingly, a tug of war between providing higher levels of abstractions and the need to understand the behavior of the trained model is prevalent.

[1]  Alessandro Bozzon,et al.  Sparrows and Owls: Characterisation of Expert Behaviour in StackOverflow , 2014, UMAP.

[2]  Gabriele Bavota,et al.  How do API changes trigger stack overflow discussions? a study on the Android SDK , 2014, ICPC 2014.

[3]  Sriram K. Rajamani,et al.  Debugging Machine Learning Tasks , 2016, ArXiv.

[4]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[5]  Cesare Furlanello,et al.  A machine learning pipeline for quantitative phenotype prediction from genotype data , 2010, BMC Bioinformatics.

[6]  K. Seers Qualitative data analysis , 2011, Evidence Based Nursing.

[7]  Felipe Ebert,et al.  An Empirical Study on the Usage of the Swift Programming Language , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[8]  Sujoy Roy,et al.  Effects of Loss Functions And Target Representations on Adversarial Robustness , 2018, ArXiv.

[9]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[10]  Melissa Baralt,et al.  Coding Qualitative Data , 2012 .

[11]  Arash Joorabchi,et al.  Text mining stackoverflow: An insight into challenges and subject-related difficulties faced by computer science learners , 2016, J. Enterp. Inf. Manag..

[12]  Ying Zhu,et al.  Various Frameworks and Libraries of Machine Learning and Deep Learning: A Survey , 2019, Archives of Computational Methods in Engineering.

[13]  B MilesMatthew,et al.  Qualitative Data Analysis , 2009, Approaches and Processes of Social Science Research.

[14]  W. B. Roberts,et al.  Machine Learning: The High Interest Credit Card of Technical Debt , 2014 .

[15]  H. Lilliefors On the Kolmogorov-Smirnov Test for Normality with Mean and Variance Unknown , 1967 .

[16]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[17]  Kevin A Hallgren,et al.  Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial. , 2012, Tutorials in quantitative methods for psychology.

[18]  Shrish Verma,et al.  Selecting Best Answer: An Empirical Analysis on Community Question Answering Sites , 2016, IEEE Access.

[19]  Sean Owen,et al.  Mahout in Action , 2011 .

[20]  Premkumar T. Devanbu,et al.  Using and Asking: APIs Used in the Android Market and Asked about in StackOverflow , 2013, SocInfo.

[21]  Harald C. Gall,et al.  Software Engineering for Machine Learning: A Case Study , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).

[22]  Razvan Pascanu,et al.  Theano: Deep Learning on GPUs with Python , 2012 .

[23]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[24]  D. Sculley,et al.  Hidden Technical Debt in Machine Learning Systems , 2015, NIPS.

[25]  Grzegorz Chrupala,et al.  Predicting the quality of questions on Stackoverflow , 2015, RANLP.

[26]  Joshua J. Bloch How to design a good API and why it matters , 2006, OOPSLA '06.

[27]  Ahmed E. Hassan,et al.  What are developers talking about? An analysis of topics and trends in Stack Overflow , 2014, Empirical Software Engineering.

[28]  Gang Yin,et al.  Evaluating Bug Severity Using Crowd-based Knowledge: An Exploratory Study , 2015, Internetware.

[29]  Mircea Lungu,et al.  Geo-locating the knowledge transfer in StackOverflow , 2013, SSE 2013.

[30]  Emad Shihab,et al.  What are mobile developers asking about? A large scale study using stack overflow , 2016, Empirical Software Engineering.

[31]  Ladislav Hluchý,et al.  Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey , 2019, Artificial Intelligence Review.

[32]  Gilles Louppe,et al.  Independent consultant , 2013 .

[33]  Christoph Treude,et al.  How do programmers ask and answer questions on the web?: NIER track , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[34]  Samy Bengio,et al.  Torch: a modular machine learning software library , 2002 .

[35]  Mira Mezini,et al.  MUBench: A Benchmark for API-Misuse Detectors , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[36]  A. Strauss Basics Of Qualitative Research , 1992 .

[37]  D. Sculley,et al.  What’s your ML test score? A rubric for ML production systems , 2016 .

[38]  Miryung Kim,et al.  An Empirical Study of API Stability and Adoption in the Android Ecosystem , 2013, 2013 IEEE International Conference on Software Maintenance.

[39]  Bastin Tony Roy Savarimuthu,et al.  Crowdsourced Knowledge on Stack Overflow: A Systematic Mapping Study , 2017, EASE.

[40]  Michael W. Godfrey,et al.  Detecting API usage obstacles: A study of iOS and Android developer questions , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[41]  Clayton Stanley Predicting Tags for StackOverflow Posts , 2013 .

[42]  Ian H. Witten,et al.  WEKA: a machine learning workbench , 1994, Proceedings of ANZIIS '94 - Australian New Zealnd Intelligent Information Systems Conference.

[43]  David Lo,et al.  An empirical study on developer interactions in StackOverflow , 2013, SAC '13.