Detecting flaky tests in probabilistic and machine learning applications

Probabilistic programming systems and machine learning frameworks like Pyro, PyMC3, TensorFlow, and PyTorch provide scalable and efficient primitives for inference and training. However, such operations are non-deterministic. Hence, it is challenging for developers to write tests for applications that depend on such frameworks, often resulting in flaky tests – tests which fail non-deterministically when run on the same version of code. In this paper, we conduct the first extensive study of flaky tests in this domain. In particular, we study the projects that depend on four frameworks: Pyro, PyMC3, TensorFlow-Probability, and PyTorch. We identify 75 bug reports/commits that deal with flaky tests, and we categorize the common causes and fixes for them. This study provides developers with useful insights on dealing with flaky tests in this domain. Motivated by our study, we develop a technique, FLASH, to systematically detect flaky tests due to assertions passing and failing in different runs on the same code. These assertions fail due to differences in the sequence of random numbers in different runs of the same test. FLASH exposes such failures, and our evaluation on 20 projects results in 11 previously-unknown flaky tests that we reported to developers.

[1]  Jiqiang Guo,et al.  Stan: A Probabilistic Programming Language. , 2017, Journal of statistical software.

[2]  Koushik Sen,et al.  JQF: coverage-guided property-based testing in Java , 2019, ISSTA.

[3]  Darko Marinov,et al.  Detecting Assumptions on Deterministic Implementations of Non-deterministic Specifications , 2016, 2016 IEEE International Conference on Software Testing, Verification and Validation (ICST).

[4]  Dianhui Wang,et al.  Randomness in neural networks: an overview , 2017, WIREs Data Mining Knowl. Discov..

[5]  Tao Xie,et al.  iFixFlakies: a framework for automatically fixing order-dependent flaky tests , 2019, ESEC/SIGSOFT FSE.

[6]  Timon Gehr,et al.  Incremental inference for probabilistic programs , 2018, PLDI.

[7]  Mark Harman,et al.  Regression testing minimization, selection and prioritization: a survey , 2012, Softw. Test. Verification Reliab..

[8]  Noah D. Goodman,et al.  Pyro: Deep Universal Probabilistic Programming , 2018, J. Mach. Learn. Res..

[9]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[10]  SchmidhuberJürgen Deep learning in neural networks , 2015 .

[11]  Mark Harman,et al.  FlakiMe: Laboratory-Controlled Test Flakiness Impact Assessment. A Case Study on Mutation Testing and Program Repair , 2019, ArXiv.

[12]  Dawn Xiaodong Song,et al.  PerfFuzz: automatically generating pathological inputs , 2018, ISSTA.

[13]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[14]  Yura N. Perov,et al.  Venture: a higher-order probabilistic programming platform with programmable inference , 2014, ArXiv.

[15]  Dustin Tran,et al.  Deep Probabilistic Programming , 2017, ICLR.

[16]  Darko Marinov,et al.  An empirical analysis of flaky tests , 2014, SIGSOFT FSE.

[17]  Suman Nath,et al.  Root causing flaky tests in a large-scale industrial setting , 2019, ISSTA.

[18]  Yves Le Traon,et al.  Semantic fuzzing with zest , 2018, ISSTA.

[19]  Wei-Tek Tsai,et al.  Regression testing in an industrial environment , 1998, CACM.

[20]  Joshua B. Tenenbaum,et al.  Church: a language for generative models , 2008, UAI.

[21]  Sasa Misailovic,et al.  Statistical Algorithmic Profiling for Randomized Approximate Programs , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[22]  Mark Harman,et al.  Machine Learning Testing: Survey, Landscapes and Horizons , 2019, IEEE Transactions on Software Engineering.

[23]  J. Rice Mathematical Statistics and Data Analysis , 1988 .

[24]  Dustin Tran,et al.  Edward: A library for probabilistic modeling, inference, and criticism , 2016, ArXiv.

[25]  Hitesh Sajnani,et al.  A Study on the Lifecycle of Flaky Tests , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[26]  Adrian Raftery,et al.  The Number of Iterations, Convergence Diagnostics and Generic Metropolis Algorithms , 1995 .

[27]  Osvaldo A. Martin,et al.  ArviZ a unified library for exploratory analysis of Bayesian models in Python , 2019, J. Open Source Softw..

[28]  M Bates,et al.  Models of natural language understanding. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[30]  Lin Tan,et al.  CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[31]  Na Meng,et al.  An Empirical Study of Flaky Tests in Android Apps , 2018, 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[32]  John Salvatier,et al.  Probabilistic programming in Python using PyMC3 , 2016, PeerJ Comput. Sci..

[33]  Mark Harman,et al.  The Oracle Problem in Software Testing: A Survey , 2015, IEEE Transactions on Software Engineering.

[34]  Peter Henderson,et al.  An Introduction to Deep Reinforcement Learning , 2018, Found. Trends Mach. Learn..

[35]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[36]  Koushik Sen,et al.  FuzzFactory: domain-specific fuzzing with waypoints , 2019, Proc. ACM Program. Lang..

[37]  Avi Pfeffer,et al.  IBAL: A Probabilistic Rational Programming Language , 2001, IJCAI.

[38]  Chung-Kil Hur,et al.  R2: An Efficient MCMC Sampler for Probabilistic Programs , 2014, AAAI.

[39]  Timon Gehr,et al.  PSI: Exact Symbolic Inference for Probabilistic Programs , 2016, CAV.

[40]  Saikat Dutta,et al.  Storm: program reduction for testing and debugging probabilistic programming systems , 2019, ESEC/SIGSOFT FSE.

[41]  R. P. Jagadeesh Chandra Bose,et al.  Identifying implementation bugs in machine learning based image classifiers using metamorphic testing , 2018, ISSTA.

[42]  Darko Marinov,et al.  DeFlaker: Automatically Detecting Flaky Tests , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[43]  Zhi-Hua Zhou,et al.  A brief introduction to weakly supervised learning , 2018 .

[44]  Andrew D. Gordon,et al.  Bayesian inference using data flow analysis , 2013, ESEC/FSE 2013.

[45]  Wing Lam,et al.  iDFlakies: A Framework for Detecting and Partially Classifying Flaky Tests , 2019, 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST).

[46]  Frank D. Wood,et al.  A New Approach to Probabilistic Programming Inference , 2014, AISTATS.

[47]  Saikat Dutta,et al.  Testing probabilistic programming systems , 2018, ESEC/SIGSOFT FSE.

[48]  Andreas Zeller,et al.  Practical Test Dependency Detection , 2018, 2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST).

[49]  Walter R. Gilks,et al.  A Language and Program for Complex Bayesian Modelling , 1994 .

[50]  Peter W. O'Hearn,et al.  From Start-ups to Scale-ups: Opportunities and Open Problems for Static and Dynamic Program Analysis , 2018, 2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM).

[51]  Matthew J. Beal Variational algorithms for approximate Bayesian inference , 2003 .

[52]  John Geweke,et al.  Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments , 1991 .

[53]  Jinqiu Yang,et al.  A Study of Oracle Approximations in Testing Deep Learning Libraries , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).