Balancing Effectiveness and Flakiness of Non-Deterministic Machine Learning Tests

— Testing Machine Learning (ML) projects is chal- lenging due to inherent non-determinism of various ML algorithms and the lack of reliable ways to compute reference results. Developers typically rely on their intuition when writing tests to check whether ML algorithms produce accurate results. However, this approach leads to conservative choices in selecting assertion bounds for comparing actual and expected results in test assertions. Because developers want to avoid false positive failures in tests, they often set the bounds to be too loose, potentially leading to missing critical bugs. We present FASER – the first systematic approach for bal- ancing the trade-off between the fault-detection effectiveness and flakiness of non-deterministic tests by computing optimal assertion bounds . FASER frames this trade-off as an optimization problem between these competing objectives by varying the assertion bound. FASER leverages 1) statistical methods to estimate the flakiness rate, and 2) mutation testing to estimate the fault-detection effectiveness. We evaluate FASER on 87 non-deterministic tests collected from 22 popular ML projects. FASER finds that 23 out of 87 studied tests have conservative bounds and proposes tighter assertion bounds that maximizes the fault-detection effectiveness of the tests while limiting flakiness. We have sent 19 pull requests to developers, each fixing one test, out of which 14 pull requests have already been accepted.

[1]  Lingming Zhang,et al.  Fuzzing Automatic Differentiation in Deep-Learning Libraries , 2023, 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE).

[2]  Aurojit Panda,et al.  NNSmith: Generating Diverse and Valid Test Cases for Deep Learning Compilers , 2022, ASPLOS.

[3]  Lingming Zhang,et al.  Fuzzing deep-learning libraries via automated relational API inference , 2022, ESEC/SIGSOFT FSE.

[4]  Sasa Misailovic,et al.  To Seed or Not to Seed? An Empirical Analysis of Usage of Seeds for Testing in Machine Learning Projects , 2022, 2022 IEEE Conference on Software Testing, Verification and Validation (ICST).

[5]  Lingming Zhang,et al.  Coverage-guided tensor compiler fuzzing with joint IR-pass mutation , 2022, Proc. ACM Program. Lang..

[6]  Lingming Zhang,et al.  Free Lunch for Testing: Fuzzing Deep-Learning Libraries from Open Source , 2022, 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE).

[7]  Sasa Misailovic,et al.  FLEX: fixing flaky tests in machine learning projects by updating assertion bounds , 2021, ESEC/SIGSOFT FSE.

[8]  Sasa Misailovic,et al.  TERA: optimizing stochastic regression tests in machine learning projects , 2021, ISSTA.

[9]  Yepang Liu,et al.  To what extent do DNN-based image classification models make unreliable inferences? , 2021, Empirical Software Engineering.

[10]  Darko Marinov,et al.  Domain-Specific Fixes for Flaky Tests with Wrong Assumptions on Underdetermined Specifications , 2021, 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE).

[11]  Wei Yang,et al.  An Empirical Analysis of UI-Based Flaky Tests , 2021, 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE).

[12]  Liqian Chen,et al.  Detecting numerical bugs in neural network architectures , 2020, ESEC/SIGSOFT FSE.

[13]  Chao Shen,et al.  Audee: Automated Testing for Deep Learning Frameworks , 2020, 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[14]  Sasa Misailovic,et al.  Detecting flaky tests in probabilistic and machine learning applications , 2020, International Symposium on Software Testing and Analysis.

[15]  T. Chen,et al.  Metamorphic Testing: A New Approach for Generating Next Test Cases , 2020, ArXiv.

[16]  Jinqiu Yang,et al.  A Study of Oracle Approximations in Testing Deep Learning Libraries , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[17]  Lei Ma,et al.  DeepMutation++: A Mutation Testing Framework for Deep Learning Systems , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[18]  Gabriele Bavota,et al.  Taxonomy of Real Faults in Deep Learning Systems , 2019, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[19]  Jie M. Zhang,et al.  Automatic Testing and Improvement of Machine Translation , 2019, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[20]  Saikat Dutta,et al.  Storm: program reduction for testing and debugging probabilistic programming systems , 2019, ESEC/SIGSOFT FSE.

[21]  Tao Xie,et al.  iFixFlakies: a framework for automatically fixing order-dependent flaky tests , 2019, ESEC/SIGSOFT FSE.

[22]  Pinjia He,et al.  Structure-Invariant Testing for Machine Translation , 2019, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[23]  Darko Marinov,et al.  Mitigating the effects of flaky tests on mutation testing , 2019, ISSTA.

[24]  Suman Nath,et al.  Root causing flaky tests in a large-scale industrial setting , 2019, ISSTA.

[25]  Mark Harman,et al.  Machine Learning Testing: Survey, Landscapes and Horizons , 2019, IEEE Transactions on Software Engineering.

[26]  T. Davenport,et al.  The potential for artificial intelligence in healthcare , 2019, Future Healthcare Journal.

[27]  Sasa Misailovic,et al.  Statistical Algorithmic Profiling for Randomized Approximate Programs , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[28]  Lin Tan,et al.  CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[29]  Wing Lam,et al.  iDFlakies: A Framework for Detecting and Partially Classifying Flaky Tests , 2019, 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST).

[30]  Liqun Sun,et al.  Metamorphic testing of driverless cars , 2019, Commun. ACM.

[31]  Peter Henderson,et al.  An Introduction to Deep Reinforcement Learning , 2018, Found. Trends Mach. Learn..

[32]  Saikat Dutta,et al.  Testing probabilistic programming systems , 2018, ESEC/SIGSOFT FSE.

[33]  David S. Rosenblum,et al.  Verifying the long-run behavior of probabilistic system models in the presence of uncertainty , 2018, ESEC/SIGSOFT FSE.

[34]  Peter W. O'Hearn,et al.  From Start-ups to Scale-ups: Opportunities and Open Problems for Static and Dynamic Program Analysis , 2018, 2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM).

[35]  Sarfraz Khurshid,et al.  DeepRoad: GAN-Based Metamorphic Testing and Input Validation Framework for Autonomous Driving Systems , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[36]  Benoit Baudry,et al.  Descartes: A PITest Engine to Detect Pseudo-Tested Methods: Tool Demonstration , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[37]  Yifan Chen,et al.  An empirical study on TensorFlow program bugs , 2018, ISSTA.

[38]  R. P. Jagadeesh Chandra Bose,et al.  Identifying implementation bugs in machine learning based image classifiers using metamorphic testing , 2018, ISSTA.

[39]  Darko Marinov,et al.  DeFlaker: Automatically Detecting Flaky Tests , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[40]  Sarfraz Khurshid,et al.  Approximate Transformations as Mutation Operators , 2018, 2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST).

[41]  Pushmeet Kohli,et al.  Adversarial Risk and the Dangers of Evaluating Against Weak Attacks , 2018, ICML.

[42]  Suman Jana,et al.  DeepTest: Automated Testing of Deep-Neural-Network-Driven Autonomous Cars , 2017, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[43]  Jiqiang Guo,et al.  Stan: A Probabilistic Programming Language. , 2017, Journal of statistical software.

[44]  Timon Gehr,et al.  PSI: Exact Symbolic Inference for Probabilistic Programs , 2016, CAV.

[45]  Jeffrey M. Voas,et al.  Metamorphic Testing for Cybersecurity , 2016, Computer.

[46]  Darko Marinov,et al.  Detecting Assumptions on Deterministic Implementations of Non-deterministic Specifications , 2016, 2016 IEEE International Conference on Software Testing, Verification and Validation (ICST).

[47]  Tsong Yueh Chen,et al.  Metamorphic Testing for Software Quality Assessment: A Study of Search Engines , 2016, IEEE Transactions on Software Engineering.

[48]  김종영 구글 TensorFlow 소개 , 2015 .

[49]  Yves Le Traon,et al.  Trivial Compiler Equivalence: A Large Scale Empirical Study of a Simple, Fast and Effective Equivalent Mutant Detection Technique , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[50]  Fernando A. Mujica,et al.  An Empirical Evaluation of Deep Learning on Highway Driving , 2015, ArXiv.

[51]  Darko Marinov,et al.  An empirical analysis of flaky tests , 2014, SIGSOFT FSE.

[52]  Vance W. Berger,et al.  Kolmogorov–Smirnov Test: Overview , 2014 .

[53]  Sarfraz Khurshid,et al.  Operator-based and random mutant selection: Better together , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[54]  Gilles Pokam,et al.  Selective mutation testing for concurrent code , 2013, ISSTA.

[55]  Mark Harman,et al.  An Analysis and Survey of the Development of Mutation Testing , 2011, IEEE Transactions on Software Engineering.

[56]  R. Marler,et al.  The weighted sum method for multi-objective optimization: new insights , 2010 .

[57]  Andreas Zeller,et al.  The Impact of Equivalent Mutants , 2009, 2009 International Conference on Software Testing, Verification, and Validation Workshops.

[58]  Joshua B. Tenenbaum,et al.  Church: a language for generative models , 2008, UAI.

[59]  Anirban DasGupta,et al.  Best constants in Chebyshev inequalities with various applications , 2000 .

[60]  J. Doye,et al.  Global Optimization by Basin-Hopping and the Lowest Energy Structures of Lennard-Jones Clusters Containing up to 110 Atoms , 1997, cond-mat/9803344.

[61]  F. J. Anscombe,et al.  Distribution of the Kurtosis Statistic b2 for Normal Samples. , 1983 .

[62]  V. V. Buldygin,et al.  Sub-Gaussian random variables , 1980 .

[63]  S. Shapiro,et al.  An Analysis of Variance Test for Normality (Complete Samples) , 1965 .

[64]  F. Massey The Kolmogorov-Smirnov Test for Goodness of Fit , 1951 .

[65]  Lingming Zhang,et al.  Fuzzing Deep-Learning Libraries via Large Language Models , 2022, ArXiv.

[66]  Sasa Misailovic,et al.  AQUA: Automated Quantized Inference for Probabilistic Programs , 2021, ATVA.

[67]  S. Sagar Imambi,et al.  PyTorch , 2021, Programming with TensorFlow.

[68]  I. Comparison Faster Mutation Testing Inspired by Test Prioritization and Reduction , 2013 .