On Testing Machine Learning Programs

Nowadays, we are witnessing a wide adoption of Machine learning (ML) models in many safety-critical systems, thanks to recent breakthroughs in deep learning and reinforcement learning. Many people are now interacting with systems based on ML every day, e.g., voice recognition systems used by virtual personal assistants like Amazon Alexa or Google Home. As the field of ML continues to grow, we are likely to witness transformative advances in a wide range of areas, from finance, energy, to health and transportation. Given this growing importance of ML-based systems in our daily life, it is becoming utterly important to ensure their reliability. Recently, software researchers have started adapting concepts from the software testing domain (e.g., code coverage, mutation testing, or property-based testing) to help ML engineers detect and correct faults in ML programs. This paper reviews current existing testing practices for ML programs. First, we identify and explain challenges that should be addressed when testing ML programs. Next, we report existing solutions found in the literature for testing ML programs. Finally, we identify gaps in the literature related to the testing of ML programs and make recommendations of future research directions for the scientific community. We hope that this comprehensive review of software testing practices will help ML engineers identify the right approach to improve the reliability of their ML-based systems. We also hope that the research community will act on our proposed research directions to advance the state of the art of testing for ML programs.

[1]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[2]  Eliza Strickland Doc bot preps for the O.R. , 2016, IEEE Spectrum.

[3]  Sanjay Krishnan,et al.  BoostClean: Automated Error Detection and Repair for Machine Learning , 2017, ArXiv.

[4]  Hareton K. N. Leung,et al.  A survey of combinatorial testing , 2011, CSUR.

[5]  J Hayhurst Kelly,et al.  A Practical Tutorial on Modified Condition/Decision Coverage , 2001 .

[6]  Claus Nebauer,et al.  Evaluation of convolutional neural networks for visual recognition , 1998, IEEE Trans. Neural Networks.

[7]  Jianzhong Li,et al.  Impacts of Dirty Data: and Experimental Evaluation , 2018, ArXiv.

[8]  Toshiaki Yasue,et al.  A Survey of Software Quality for Machine Learning Applications , 2018, 2018 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW).

[9]  Tim Kraska,et al.  SampleClean: Fast and Reliable Analytics on Dirty Data , 2015, IEEE Data Eng. Bull..

[10]  Krishna P. Gummadi,et al.  Fairness Beyond Disparate Treatment & Disparate Impact: Learning Classification without Disparate Mistreatment , 2016, WWW.

[11]  Patrick D. McDaniel,et al.  Machine Learning in Adversarial Settings , 2016, IEEE Security & Privacy.

[12]  Ryan P. Adams,et al.  Motivating the Rules of the Game for Adversarial Example Research , 2018, ArXiv.

[13]  Foutse Khomh,et al.  DeepEvolution: A Search-Based Testing Approach for Deep Neural Networks , 2019, 2019 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[14]  Elaine J. Weyuker,et al.  On Testing Non-Testable Programs , 1982, Comput. J..

[15]  Baowen Xu,et al.  Testing and validating machine learning classifiers by metamorphic testing , 2011, J. Syst. Softw..

[16]  Lei Ma,et al.  DeepHunter: Hunting Deep Neural Network Defects via Coverage-Guided Fuzzing , 2018, 1809.01266.

[17]  Yue Zhao,et al.  DLFuzz: differential fuzzing testing of deep learning systems , 2018, ESEC/SIGSOFT FSE.

[18]  Chris Murphy,et al.  An Approach to Software Testing of Machine Learning Applications , 2007, SEKE.

[19]  Yifan Chen,et al.  An empirical study on TensorFlow program bugs , 2018, ISSTA.

[20]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[21]  Song Huang,et al.  Challenges of Testing Machine Learning Applications , 2018 .

[22]  Daniel Kroening,et al.  Testing Deep Neural Networks , 2018, ArXiv.

[23]  A. Jefferson Offutt,et al.  MuJava: an automated class mutation system , 2005, Softw. Test. Verification Reliab..

[24]  Seyed-Mohsen Moosavi-Dezfooli,et al.  DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Lei Ma,et al.  DeepMutation: Mutation Testing of Deep Learning Systems , 2018, 2018 IEEE 29th International Symposium on Software Reliability Engineering (ISSRE).

[26]  Foutse Khomh,et al.  TFCheck : A TensorFlow Library for Detecting Training Issues in Neural Network Programs , 2019, 2019 IEEE 19th International Conference on Software Quality, Reliability and Security (QRS).

[27]  Tsong Yueh Chen,et al.  Metamorphic Testing: A New Approach for Generating Next Test Cases , 2020, ArXiv.

[28]  Roger B. Grosse,et al.  Testing MCMC code , 2014, ArXiv.

[29]  W. B. Roberts,et al.  Machine Learning: The High Interest Credit Card of Technical Debt , 2014 .

[30]  Junfeng Yang,et al.  DeepXplore , 2019, Commun. ACM.

[31]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[32]  Ian Goodfellow,et al.  TensorFuzz: Debugging Neural Networks with Coverage-Guided Fuzzing , 2018, ICML.

[33]  Samy Bengio,et al.  Adversarial examples in the physical world , 2016, ICLR.

[34]  Suman Jana,et al.  DeepTest: Automated Testing of Deep-Neural-Network-Driven Autonomous Cars , 2017, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[35]  Yuriy Brun,et al.  Fairness testing: testing software for discrimination , 2017, ESEC/SIGSOFT FSE.

[36]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[37]  Luca Rigazio,et al.  Towards Deep Neural Network Architectures Robust to Adversarial Examples , 2014, ICLR.

[38]  D. Sculley,et al.  The Data Linter: Lightweight Automated Sanity Checking for ML Data Sets , 2017 .

[39]  Sanjay Krishnan,et al.  ActiveClean: Interactive Data Cleaning While Learning Convex Loss Models , 2016, ArXiv.

[40]  Srinivas Nidhra,et al.  BLACK BOX AND WHITE BOX TESTING TECHNIQUES -A LITERATURE REVIEW , 2012 .

[41]  Sarfraz Khurshid,et al.  Symbolic Execution for Attribution and Attack Synthesis in Neural Networks , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion).

[42]  R. P. Jagadeesh Chandra Bose,et al.  Identifying implementation bugs in machine learning based image classifiers using metamorphic testing , 2018, ISSTA.

[43]  David L. Dill,et al.  Developing Bug-Free Machine Learning Systems With Formal Mathematics , 2017, ICML.

[44]  Mark Harman,et al.  An Analysis and Survey of the Development of Mutation Testing , 2011, IEEE Transactions on Software Engineering.

[45]  Koushik Sen,et al.  CUTE: a concolic unit testing engine for C , 2005, ESEC/FSE-13.

[46]  Lei Ma,et al.  DeepGauge: Multi-Granularity Testing Criteria for Deep Learning Systems , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[47]  Daniel Kroening,et al.  Concolic Testing for Deep Neural Networks , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[48]  Sarfraz Khurshid,et al.  DeepRoad: GAN-based Metamorphic Autonomous Driving System Testing , 2018, ArXiv.

[49]  Aleksander Madry,et al.  Exploring the Landscape of Spatial Robustness , 2017, ICML.

[50]  Gail E. Kaiser,et al.  Properties of Machine Learning Applications for Use in Metamorphic Testing , 2008, SEKE.

[51]  W. M. McKeeman,et al.  Differential Testing for Software , 1998, Digit. Tech. J..

[52]  D. Sculley,et al.  Hidden Technical Debt in Machine Learning Systems , 2015, NIPS.

[53]  Christian Igel,et al.  An Introduction to Restricted Boltzmann Machines , 2012, CIARP.

[54]  Ananthram Swami,et al.  The Limitations of Deep Learning in Adversarial Settings , 2015, 2016 IEEE European Symposium on Security and Privacy (EuroS&P).

[55]  Kush R. Varshney,et al.  Optimized Pre-Processing for Discrimination Prevention , 2017, NIPS.