An Empirical Evaluation of Mutation Operators for Deep Learning Systems

Deep Learning (DL) is increasingly adopted to solve complex tasks such as image recognition or autonomous driving. Companies are considering the inclusion of DL components in production systems, but one of their main concerns is how to assess the quality of such systems. Mutation testing is a technique to inject artificial faults into a system, under the assumption that the capability to expose (kilt) such artificial faults translates into the capability to expose also real faults. Researchers have proposed approaches and tools (e.g., Deep-Mutation and MuNN) that make mutation testing applicable to deep learning systems. However, existing definitions of mutation killing, based on accuracy drop, do not take into account the stochastic nature of the training process (accuracy may drop even when re-training the un-mutated system). Moreover, the same mutation operator might be effective or might be trivial/impossible to kill, depending on its hyper-parameter configuration. We conducted an empirical evaluation of existing operators, showing that mutation killing requires a stochastic definition and identifying the subset of effective mutation operators together with the associated most effective configurations.

[1]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[2]  C. Borland,et al.  Effect Size , 2019, SAGE Research Methods Foundations.

[3]  Logan Engstrom,et al.  Black-box Adversarial Attacks with Limited Queries and Information , 2018, ICML.

[4]  Junfeng Yang,et al.  DeepXplore: Automated Whitebox Testing of Deep Learning Systems , 2017, SOSP.

[5]  Shin Yoo,et al.  Guiding Deep Learning System Testing Using Surprise Adequacy , 2018, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[6]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Matthew Wicker,et al.  Feature-Guided Black-Box Safety Testing of Deep Neural Networks , 2017, TACAS.

[8]  Jacob Cohen,et al.  A power primer. , 1992, Psychological bulletin.

[9]  Lei Ma,et al.  DeepMutation: Mutation Testing of Deep Learning Systems , 2018, 2018 IEEE 29th International Symposium on Software Reliability Engineering (ISSRE).

[10]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[11]  Paolo Tonella,et al.  Misbehaviour Prediction for Autonomous Driving Systems , 2019, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[12]  Sarfraz Khurshid,et al.  DeepRoad: GAN-Based Metamorphic Testing and Input Validation Framework for Autonomous Driving Systems , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[13]  Anna Philippou,et al.  Tools and Algorithms for the Construction and Analysis of Systems , 2018, Lecture Notes in Computer Science.

[14]  Suman Jana,et al.  DeepTest: Automated Testing of Deep-Neural-Network-Driven Autonomous Cars , 2017, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[15]  Baowen Xu,et al.  Testing and validating machine learning classifiers by metamorphic testing , 2011, J. Syst. Softw..

[16]  Gordon Fraser,et al.  Automatically testing self-driving cars with search-based procedural content generation , 2019, ISSTA.

[17]  Gail E. Kaiser,et al.  Automatic system testing of programs without test oracles , 2009, ISSTA.

[18]  Lei Ma,et al.  DeepGauge: Multi-Granularity Testing Criteria for Deep Learning Systems , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[19]  Yue Zhao,et al.  DLFuzz: differential fuzzing testing of deep learning systems , 2018, ESEC/SIGSOFT FSE.

[20]  Jun Wan,et al.  MuNN: Mutation Analysis of Neural Networks , 2018, 2018 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C).