Machine Learning Testing: Survey, Landscapes and Horizons

This paper provides a comprehensive survey of Machine Learning Testing (ML testing) research. It covers 144 papers on testing properties (e.g., correctness, robustness, and fairness), testing components (e.g., the data, learning program, and framework), testing workflow (e.g., test generation and test evaluation), and application scenarios (e.g., autonomous driving, machine translation). The paper also analyses trends concerning datasets, research trends, and research focus, concluding with research challenges and promising research directions in ML testing.

[1]  Aws Albarghouthi,et al.  Repairing Decision-Making Programs Under Uncertainty , 2017, CAV.

[2]  Daniel Kroening,et al.  Concolic Testing for Deep Neural Networks , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[3]  Pratik Gajane,et al.  On formalizing fairness in prediction with machine learning , 2017, ArXiv.

[4]  Bram van Ginneken,et al.  A survey on deep learning in medical image analysis , 2017, Medical Image Anal..

[5]  Algirdas A. Avi The Methodology of N-Version Programming , 1995 .

[6]  Paul Barford,et al.  Data Poisoning Attacks against Autoregressive Models , 2016, AAAI.

[7]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[8]  Standard Glossary of Software Engineering Terminology , 1990 .

[9]  Günther Ruhe,et al.  Search Based Software Engineering , 2013, Lecture Notes in Computer Science.

[10]  David J. Robson,et al.  The state-based testing of object-oriented programs , 1993, 1993 Conference on Software Maintenance.

[11]  Berkman Sahiner,et al.  Test data reuse for evaluation of adaptive machine learning algorithms: over-fitting to a fixed 'test' dataset and a potential solution , 2018, Medical Imaging.

[12]  Nathan Srebro,et al.  Equality of Opportunity in Supervised Learning , 2016, NIPS.

[13]  David A. Landgrebe,et al.  A survey of decision tree classifier methodology , 1991, IEEE Trans. Syst. Man Cybern..

[14]  Lionel C. Briand,et al.  Testing advanced driver assistance systems using multi-objective search and neural networks , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[15]  Lei Ma,et al.  DeepHunter: Hunting Deep Neural Network Defects via Coverage-Guided Fuzzing , 2018, 1809.01266.

[16]  Russ Tedrake,et al.  Evaluating Robustness of Neural Networks with Mixed Integer Programming , 2017, ICLR.

[17]  Nancy G. Leveson,et al.  An empirical evaluation of the MC/DC coverage criterion on the HETE-2 satellite software , 2000, 19th DASC. 19th Digital Avionics Systems Conference. Proceedings (Cat. No.00CH37126).

[18]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[19]  Xiaoxing Ma,et al.  Structural Coverage Criteria for Neural Networks Could Be Misleading , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER).

[20]  Nikolai Tillmann,et al.  Test generation via Dynamic Symbolic Execution for mutation testing , 2010, 2010 IEEE International Conference on Software Maintenance.

[21]  Yadong Wang,et al.  Combinatorial Testing for Deep Learning Systems , 2018, ArXiv.

[22]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[23]  Junfeng Yang,et al.  DeepXplore: Automated Whitebox Testing of Deep Learning Systems , 2017, SOSP.

[24]  Xiaoxing Ma,et al.  Manifesting Bugs in Machine Learning Code: An Explorative Study with Mutation Testing , 2018, 2018 IEEE International Conference on Software Quality, Reliability and Security (QRS).

[25]  Yang Liu,et al.  Metamorphic Relation Based Adversarial Attacks on Differentiable Neural Computer , 2018, ArXiv.

[26]  Pushmeet Kohli,et al.  Rigorous Agent Evaluation: An Adversarial Approach to Uncover Catastrophic Failures , 2018, ICLR.

[27]  Mykel J. Kochenderfer,et al.  Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks , 2017, CAV.

[28]  Julio Cesar Sampaio do Prado Leite,et al.  On Non-Functional Requirements in Software Engineering , 2009, Conceptual Modeling: Foundations and Applications.

[29]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[30]  Zhi-Hua Zhou,et al.  Isolation Forest , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[31]  Xin Zhang,et al.  TFX: A TensorFlow-Based Production-Scale Machine Learning Platform , 2017, KDD.

[32]  Toshiaki Yasue,et al.  A Survey of Software Quality for Machine Learning Applications , 2018, 2018 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW).

[33]  Toniann Pitassi,et al.  Fairness through awareness , 2011, ITCS '12.

[34]  P. Bickel,et al.  Sex Bias in Graduate Admissions: Data from Berkeley , 1975, Science.

[35]  Brandon M. Greenwell,et al.  Interpretable Machine Learning , 2019, Hands-On Machine Learning with R.

[36]  Michael P. Wellman,et al.  Towards the Science of Security and Privacy in Machine Learning , 2016, ArXiv.

[37]  Lu Zhang,et al.  Search-based inference of polynomial metamorphic relations , 2014, ASE.

[38]  Ravishankar K. Iyer,et al.  ML-Based Fault Injection for Autonomous Vehicles: A Case for Bayesian Fault Injection , 2019, 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[39]  Yi Li,et al.  DeepCruiser: Automated Guided Testing for Stateful Deep Learning Systems , 2018, ArXiv.

[40]  Douglas M. Hawkins,et al.  The Problem of Overfitting , 2004, J. Chem. Inf. Model..

[41]  Dave Towey,et al.  A Monte Carlo Method for Metamorphic Testing of Machine Translation Services , 2018, 2018 IEEE/ACM 3rd International Workshop on Metamorphic Testing (MET).

[42]  David Lo,et al.  An Empirical Study of Bugs in Machine Learning Systems , 2012, 2012 IEEE 23rd International Symposium on Software Reliability Engineering.

[43]  Chung-Hao Huang,et al.  Towards Dependability Metrics for Neural Networks , 2018, 2018 16th ACM/IEEE International Conference on Formal Methods and Models for System Design (MEMOCODE).

[44]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[45]  Sanjay Krishnan,et al.  ActiveClean: Interactive Data Cleaning For Statistical Modeling , 2016, Proc. VLDB Endow..

[46]  Indre Zliobaite,et al.  Fairness-aware machine learning: a perspective , 2017, ArXiv.

[47]  Jingyi Wang,et al.  Adversarial Sample Detection for Deep Neural Network through Model Mutation Testing , 2018, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[48]  N. Japkowicz Why Question Machine Learning Evaluation Methods ? ( An illustrative review of the shortcomings of current methods ) , 2006 .

[49]  Arnaud Gotlieb,et al.  Towards Testing of Deep Learning Systems with Training Set Reduction , 2019, ArXiv.

[50]  Sanjay Krishnan,et al.  AlphaClean: Automatic Generation of Data Cleaning Pipelines , 2019, ArXiv.

[51]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[52]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[53]  Tao Xie,et al.  Multiple-Implementation Testing of Supervised Learning Software , 2016, AAAI Workshops.

[54]  Lei Ma,et al.  Secure Deep Learning Engineering: A Software Quality Assurance Perspective , 2018, ArXiv.

[55]  John Langford,et al.  A Reductions Approach to Fair Classification , 2018, ICML.

[56]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[57]  Ravishankar K. Iyer,et al.  Kayotee: A Fault Injection-based System to Assess the Safety and Reliability of Autonomous Vehicles to Faults and Errors , 2019, ArXiv.

[58]  Konrad Rieck,et al.  DREBIN: Effective and Explainable Detection of Android Malware in Your Pocket , 2014, NDSS.

[59]  Yuanyuan Zhang,et al.  A search based approach to fairness analysis in requirement assignments to aid negotiation, mediation and decision making , 2009, Requirements Engineering.

[60]  D. Sculley,et al.  The ML test score: A rubric for ML production readiness and technical debt reduction , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[61]  Xiaoxing Ma,et al.  Boosting operational DNN testing efficiency through conditioning , 2019, ESEC/SIGSOFT FSE.

[62]  Weijie Chen,et al.  Classifier variability: Accounting for training and testing , 2012, Pattern Recognit..

[63]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[64]  Mark Harman,et al.  A Theoretical and Empirical Study of Search-Based Testing: Local, Global, and Hybrid Search , 2010, IEEE Transactions on Software Engineering.

[65]  Tao Xie,et al.  Detecting Failures of Neural Machine Translation in the Absence of Reference Translations , 2019, 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks – Industry Track.

[66]  Cody Fleming,et al.  Towards Improved Testing For Deep Learning , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER).

[67]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[68]  Gail E. Kaiser,et al.  Properties of Machine Learning Applications for Use in Metamorphic Testing , 2008, SEKE.

[69]  Lu Zhang,et al.  Predictive Mutation Testing , 2016, IEEE Transactions on Software Engineering.

[70]  R. P. Jagadeesh Chandra Bose,et al.  Identifying implementation bugs in machine learning based image classifiers using metamorphic testing , 2018, ISSTA.

[71]  Gail E. Kaiser,et al.  Using JML Runtime Assertion Checking to Automate Metamorphic Testing in Applications without Test Oracles , 2009, 2009 International Conference on Software Testing Verification and Validation.

[72]  Samuel Madden,et al.  MISTIQUE: A System to Store and Query Model Intermediates for Model Diagnosis , 2018, SIGMOD Conference.

[73]  Sarfraz Khurshid,et al.  DeepRoad: GAN-Based Metamorphic Testing and Input Validation Framework for Autonomous Driving Systems , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[74]  Shin Nakajima,et al.  Dataset Coverage for Testing Machine Learning Computer Programs , 2016, 2016 23rd Asia-Pacific Software Engineering Conference (APSEC).

[75]  Jun Sun,et al.  Detecting Adversarial Samples for Deep Neural Networks through Mutation Testing , 2018, ArXiv.

[76]  Yuriy Brun,et al.  Fairness testing: testing software for discrimination , 2017, ESEC/SIGSOFT FSE.

[77]  Bernease Herman,et al.  The Promise and Peril of Human Evaluation for Model Interpretability , 2017, ArXiv.

[78]  Reid A. Johnson,et al.  Calibrating Probability with Undersampling for Unbalanced Classification , 2015, 2015 IEEE Symposium Series on Computational Intelligence.

[79]  Satoshi Masuda,et al.  A Test Architecture for Machine Learning Product , 2018, 2018 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW).

[80]  Felix Bießmann,et al.  Automating Large-Scale Data Quality Verification , 2018, Proc. VLDB Endow..

[81]  Seth Flaxman,et al.  European Union Regulations on Algorithmic Decision-Making and a "Right to Explanation" , 2016, AI Mag..

[82]  J. Voas,et al.  Software Testability: The New Verification , 1995, IEEE Softw..

[83]  Patrick D. McDaniel,et al.  Cleverhans V0.1: an Adversarial Machine Learning Library , 2016, ArXiv.

[84]  Annibale Panichella,et al.  Testing Autonomous Cars for Feature Interaction Failures using Many-Objective Search , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[85]  David Clark,et al.  Squeeziness: An information theoretic measure for avoiding fault masking , 2012, Inf. Process. Lett..

[86]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[87]  Lionel C. Briand,et al.  Testing Vision-Based Control Systems Using Learnable Evolutionary Algorithms , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[88]  Matt J. Kusner,et al.  Counterfactual Fairness , 2017, NIPS.

[89]  Antonio Criminisi,et al.  Measuring Neural Net Robustness with Constraints , 2016, NIPS.

[90]  Claes Wohlin,et al.  Guidelines for snowballing in systematic literature studies and a replication in software engineering , 2014, EASE '14.

[91]  Yifan Chen,et al.  An empirical study on TensorFlow program bugs , 2018, ISSTA.

[92]  R. Avery,et al.  Credit Scoring and Its Effects on the Availability and Affordability of Credit , 2009 .

[93]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[94]  Peter W. O'Hearn,et al.  From Start-ups to Scale-ups: Opportunities and Open Problems for Static and Dynamic Program Analysis , 2018, 2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM).

[95]  Neoklis Polyzotis,et al.  Data Validation for Machine Learning , 2019, SysML.

[96]  Ting Chen,et al.  State of the art: Dynamic symbolic execution for automated test generation , 2013, Future Gener. Comput. Syst..

[97]  Yuriy Brun,et al.  Offline Contextual Bandits with High Probability Fairness Guarantees , 2019, NeurIPS.

[98]  Jinqiu Yang,et al.  A Study of Oracle Approximations in Testing Deep Learning Libraries , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[99]  Danfeng Zhang,et al.  Detecting Violations of Differential Privacy , 2018, CCS.

[100]  Sudipta Chattopadhyay,et al.  Grammar Based Directed Testing of Machine Learning Systems , 2019, ArXiv.

[101]  Yuriy Brun,et al.  Themis: automatically testing software for discrimination , 2018, ESEC/SIGSOFT FSE.

[102]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[103]  Eric Horvitz,et al.  On Human Intellect and Machine Failures: Troubleshooting Integrative Machine Learning Systems , 2016, AAAI.

[104]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[105]  Shin Nakajima,et al.  [Invited] Quality Assurance of Machine Learning Software , 2018, 2018 IEEE 7th Global Conference on Consumer Electronics (GCCE).

[106]  Paul Voigt,et al.  The Eu General Data Protection Regulation (Gdpr): A Practical Guide , 2017 .

[107]  Mark Harman,et al.  An Analysis and Survey of the Development of Mutation Testing , 2011, IEEE Transactions on Software Engineering.

[108]  Yuanyuan Zhang,et al.  “Fairness Analysis” in Requirements Assignments , 2008, 2008 16th IEEE International Requirements Engineering Conference.

[109]  Wasif Afzal,et al.  A systematic review of search-based testing for non-functional system properties , 2009, Inf. Softw. Technol..

[110]  F. Maxwell Harper,et al.  The MovieLens Datasets: History and Context , 2016, TIIS.

[111]  James C. King,et al.  Symbolic execution and program testing , 1976, CACM.

[112]  Ravishankar K. Iyer,et al.  Towards a Bayesian Approach for Assessing Fault Tolerance of Deep Neural Networks , 2019, 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks – Supplemental Volume (DSN-S).

[113]  Heike Wehrheim,et al.  Testing Machine Learning Algorithms for Balanced Data Usage , 2019, 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST).

[114]  Eric P. Xing,et al.  What If We Simply Swap the Two Text Fragments? A Straightforward yet Effective Way to Test the Robustness of Methods to Confounding Signals in Nature Language Inference Tasks , 2018, AAAI.

[115]  Liqun Sun,et al.  Metamorphic testing of driverless cars , 2019, Commun. ACM.

[116]  Days,et al.  “Feedback Loop”: The Civil Rights Act of 1964 and its Progeny , 2005 .

[117]  Shin Yoo,et al.  Guiding Deep Learning System Testing Using Surprise Adequacy , 2018, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[118]  Corina S. Pasareanu,et al.  DeepSafe: A Data-Driven Approach for Assessing Robustness of Neural Networks , 2018, ATVA.

[119]  Phil McMinn,et al.  Search‐based software test data generation: a survey , 2004, Softw. Test. Verification Reliab..

[120]  Lin Tan,et al.  CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[121]  Tom Schaul,et al.  Unit Tests for Stochastic Optimization , 2013, ICLR.

[122]  Atif M. Memon GUI Testing: Pitfalls and Process , 2002, Computer.

[123]  Bin Li,et al.  An Empirical Study on Real Bugs for Machine Learning Programs , 2017, 2017 24th Asia-Pacific Software Engineering Conference (APSEC).

[124]  Ali Shahrokni,et al.  A systematic review of software robustness , 2013, Inf. Softw. Technol..

[125]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[126]  David C. Parkes,et al.  How Do Fairness Definitions Fare?: Examining Public Attitudes Towards Algorithmic Definitions of Fairness , 2018, AIES.

[127]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[128]  Chris Murphy,et al.  An Approach to Software Testing of Machine Learning Applications , 2007, SEKE.

[129]  Meng Wang,et al.  Do Pseudo Test Suites Lead to Inflated Correlation in Measuring Test Effectiveness? , 2019, 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST).

[130]  Mihai Oltean,et al.  Fruit recognition from images using deep learning , 2017, Acta Universitatis Sapientiae, Informatica.

[131]  Eugene Wu,et al.  DeepBase: Deep Inspection of Neural Networks , 2018, SIGMOD Conference.

[132]  Keinosuke Fukunaga,et al.  Effects of Sample Size in Classifier Design , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[133]  D. Sculley,et al.  The Data Linter: Lightweight Automated Sanity Checking for ML Data Sets , 2017 .

[134]  Sharad Goel,et al.  The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning , 2018, ArXiv.

[135]  Francisco José García-Peñalvo,et al.  Enabling Adaptability in Web Forms Based on User Characteristics Detection Through A/B Testing and Machine Learning , 2018, IEEE Access.

[136]  Zhenyu Zhang,et al.  A Noise-Sensitivity-Analysis-Based Test Prioritization Technique for Deep Neural Networks , 2019, ArXiv.

[137]  Zhenchang Xing,et al.  Neural-Machine-Translation-Based Commit Message Generation: How Far Are We? , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[138]  Lei Ma,et al.  DeepGauge: Multi-Granularity Testing Criteria for Deep Learning Systems , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[139]  Sanjai Rayadurgam,et al.  Input Prioritization for Testing Neural Networks , 2019, 2019 IEEE International Conference On Artificial Intelligence Testing (AITest).

[140]  Shin Nakajima,et al.  Dataset Diversity for Metamorphic Testing of Machine Learning Software , 2018, SOFL+MSVL.

[141]  Mohit Bansal,et al.  Analyzing Compositionality-Sensitivity of NLI Models , 2018, AAAI.

[142]  D. Sculley,et al.  TensorFlow Debugger: Debugging Dataflow Graphs for Machine Learning , 2016 .

[143]  Tao Xie,et al.  Telemade: A Testing Framework for Learning-Based Malware Detection Systems , 2018, AAAI Workshops.

[144]  Aws Albarghouthi,et al.  Fairness-Aware Programming , 2019, FAT.

[145]  Ricardo Baeza-Yates,et al.  Quality-efficiency trade-offs in machine learning for text processing , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[146]  Sudipta Chattopadhyay,et al.  Automated Directed Fairness Testing , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[147]  Saikat Dutta,et al.  Storm: program reduction for testing and debugging probabilistic programming systems , 2019, ESEC/SIGSOFT FSE.

[148]  Foutse Khomh,et al.  On Testing Machine Learning Programs , 2018, J. Syst. Softw..

[149]  Koushik Sen,et al.  CUTE: a concolic unit testing engine for C , 2005, ESEC/FSE-13.

[150]  R. F. Wagner,et al.  Classifier design for computer-aided diagnosis: effects of finite sample size on the mean performance of classical and neural network classifiers. , 1999, Medical physics.

[151]  Yue Zhao,et al.  DLFuzz: differential fuzzing testing of deep learning systems , 2018, ESEC/SIGSOFT FSE.

[152]  Zachary Chase Lipton The mythos of model interpretability , 2016, ACM Queue.

[153]  David Wagner,et al.  Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods , 2017, AISec@CCS.

[154]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[155]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[156]  Daniel Kang,et al.  Model Assertions for Debugging Machine Learning , 2018 .

[157]  Luciano Baresi,et al.  An Introduction to Software Testing , 2006, FoVMT.

[158]  Mark Harman,et al.  Constructing Subtle Faults Using Higher Order Mutation Testing , 2008, 2008 Eighth IEEE International Working Conference on Source Code Analysis and Manipulation.

[159]  Baowen Xu,et al.  Testing and validating machine learning classifiers by metamorphic testing , 2011, J. Syst. Softw..

[160]  John Mylopoulos,et al.  Non-Functional Requirements in Software Engineering , 2000, International Series in Software Engineering.

[161]  Tao Xie,et al.  Testing Untestable Neural Machine Translation: An Industrial Case , 2018, 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion).

[162]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[163]  Weiming Xiang,et al.  Verification for Machine Learning, Autonomy, and Neural Networks Survey , 2018, ArXiv.

[164]  Wei Li,et al.  DeepBillboard: Systematic Physical-World Testing of Autonomous Driving Systems , 2018, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[165]  Daniel Kroening,et al.  Global Robustness Evaluation of Deep Neural Networks with Provable Guarantees for L0 Norm , 2018, ArXiv.

[166]  Yann LeCun,et al.  Measuring the VC-Dimension of a Learning Machine , 1994, Neural Computation.

[167]  Mark Harman,et al.  Automatic Testing and Improvement of Machine Translation , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[168]  Russ Tedrake,et al.  Scalable End-to-End Autonomous Vehicle Testing via Rare-event Simulation , 2018, NeurIPS.

[169]  Tim Menzies,et al.  Easy over hard: a case study on deep learning , 2017, ESEC/SIGSOFT FSE.

[170]  Alberto L. Sangiovanni-Vincentelli,et al.  Systematic Testing of Convolutional Neural Networks for Autonomous Driving , 2017, ArXiv.

[171]  Suman Jana,et al.  DeepTest: Automated Testing of Deep-Neural-Network-Driven Autonomous Cars , 2017, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[172]  Jameleddine Hassine,et al.  Validation of Machine Learning Classifiers Using Metamorphic Testing and Feature Selection Techniques , 2017, MIWAI.

[173]  Ian J. Goodfellow,et al.  Technical Report on the CleverHans v2.1.0 Adversarial Examples Library , 2016 .

[174]  Sanjay Krishnan,et al.  ActiveClean: An Interactive Data Cleaning Framework For Modern Machine Learning , 2016, SIGMOD Conference.

[175]  Gail E. Kaiser,et al.  Automatic system testing of programs without test oracles , 2009, ISSTA.

[176]  Mark Harman,et al.  An analysis of the relationship between conditional entropy and failed error propagation in software testing , 2014, ICSE.

[177]  Matthew Wicker,et al.  Feature-Guided Black-Box Safety Testing of Deep Neural Networks , 2017, TACAS.

[178]  Roxana Geambasu,et al.  FairTest: Discovering Unwarranted Associations in Data-Driven Applications , 2015, 2017 IEEE European Symposium on Security and Privacy (EuroS&P).

[179]  Sumit Kumar Jha,et al.  Integrating symbolic and statistical methods for testing intelligent systems: Applications to machine learning and computer vision , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[180]  Jan Kautz,et al.  Unsupervised Image-to-Image Translation Networks , 2017, NIPS.

[181]  Zhendong Su,et al.  Compiler validation via equivalence modulo inputs , 2014, PLDI.

[182]  Yu. L. Karpov,et al.  Adaptation of General Concepts of Software Testing to Neural Networks , 2018, Programming and Computer Software.

[183]  Timon Gehr,et al.  DP-Finder: Finding Differential Privacy Violations by Sampling and Optimization , 2018, CCS.

[184]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[185]  Berkman Sahiner,et al.  Calibration of medical diagnostic classifier scores to the probability of disease , 2016, Statistical methods in medical research.

[186]  Tim Miller,et al.  Explanation in Artificial Intelligence: Insights from the Social Sciences , 2017, Artif. Intell..

[187]  Mark Harman,et al.  Perturbed Model Validation: A New Framework to Validate Model Relevance , 2019, ArXiv.

[188]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[189]  Wen-Chuan Lee,et al.  MODE: automated neural network model debugging via state differential analysis and input selection , 2018, ESEC/SIGSOFT FSE.

[190]  Carlos Eduardo Scheidegger,et al.  Assessing the Local Interpretability of Machine Learning Models , 2019, ArXiv.

[191]  Andrew D. Selbst,et al.  Big Data's Disparate Impact , 2016 .

[192]  Jianxiong Xiao,et al.  DeepDriving: Learning Affordance for Direct Perception in Autonomous Driving , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[193]  Fei-Fei Li,et al.  Visualizing and Understanding Recurrent Networks , 2015, ArXiv.

[194]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[195]  Neoklis Polyzotis,et al.  Data Management Challenges in Production Machine Learning , 2017, SIGMOD Conference.

[196]  Yi Qin,et al.  SynEva: Evaluating ML Programs by Mirror Program Synthesis , 2018, 2018 IEEE International Conference on Software Quality, Reliability and Security (QRS).

[197]  Uri Alon,et al.  code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..

[198]  P. Massart,et al.  Concentration inequalities and model selection , 2007 .

[199]  Gaétan Hains,et al.  Towards formal methods and software engineering for deep learning: Security, safety and productivity for dl systems development , 2018, 2018 Annual IEEE International Systems Conference (SysCon).

[200]  Mark Harman,et al.  The Oracle Problem in Software Testing: A Survey , 2015, IEEE Transactions on Software Engineering.

[201]  Li Li,et al.  An Orchestrated Empirical Study on Deep Learning Frameworks and Platforms , 2018, ArXiv.

[202]  Seyed-Mohsen Moosavi-Dezfooli,et al.  DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[203]  Tsong Yueh Chen,et al.  METTLE: A METamorphic Testing Approach to Assessing and Validating Unsupervised Machine Learning Systems , 2018, IEEE Transactions on Reliability.

[204]  Fuyuki Ishikawa Concepts in Quality Assessment for Machine Learning - From Test Data to Arguments , 2018, ER.

[205]  Julia Rubin,et al.  Fairness Definitions Explained , 2018, 2018 IEEE/ACM International Workshop on Software Fairness (FairWare).

[206]  Lei Ma,et al.  DeepMutation: Mutation Testing of Deep Learning Systems , 2018, 2018 IEEE 29th International Symposium on Software Reliability Engineering (ISSRE).

[207]  Matthias Woehrle,et al.  Open Questions in Testing of Learned Computer Vision Functions for Automated Driving , 2019, SAFECOMP Workshops.

[208]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[209]  Joachim Wegener,et al.  Evaluation of Different Fitness Functions for the Evolutionary Testing of an Autonomous Parking System , 2004, GECCO.

[210]  Jenna Burrell,et al.  How the machine ‘thinks’: Understanding opacity in machine learning algorithms , 2016 .

[211]  Kang Li,et al.  Security Risks in Deep Learning Implementations , 2017, 2018 IEEE Security and Privacy Workshops (SPW).

[212]  Jun Wan,et al.  MuNN: Mutation Analysis of Neural Networks , 2018, 2018 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C).

[213]  TorkarRichard,et al.  A systematic review of search-based testing for non-functional system properties , 2009 .

[214]  Julian Dolby,et al.  Ariadne: analysis for machine learning programs , 2018, MAPL@PLDI.

[215]  David A. Wagner,et al.  Towards Evaluating the Robustness of Neural Networks , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[216]  W. M. McKeeman,et al.  Differential Testing for Software , 1998, Digit. Tech. J..

[217]  Xin-Hua Hu,et al.  Validating a deep learning framework by metamorphic testing , 2017 .

[218]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[219]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[220]  Michael I. Jordan,et al.  Machine learning: Trends, perspectives, and prospects , 2015, Science.

[221]  Or Biran,et al.  Explanation and Justification in Machine Learning : A Survey Or , 2017 .

[222]  Yuriy Brun,et al.  Preventing undesirable behavior of intelligent machines , 2019, Science.

[223]  Ravishankar K. Iyer,et al.  Hands Off the Wheel in Autonomous Vehicles?: A Systems Perspective on over a Million Miles of Field Data , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[224]  Dongmei Zhang,et al.  A Framework for Ensuring the Quality of a Big Data Service , 2016, 2016 IEEE International Conference on Services Computing (SCC).

[225]  Mark Harman,et al.  A multi-objective approach to search-based test data generation , 2007, GECCO '07.

[226]  Dave Towey,et al.  Metamorphic Relations for Enhancing System Understanding and Use , 2020, IEEE Transactions on Software Engineering.

[227]  Jan Hendrik Metzen,et al.  On Detecting Adversarial Perturbations , 2017, ICLR.

[228]  Diptikalyan Saha,et al.  Automated Test Generation to Detect Individual Discrimination in AI Models , 2018, ArXiv.

[229]  Zhi Quan Zhou,et al.  Metamorphic Testing for Machine Translations: MT4MT , 2018, 2018 25th Australasian Software Engineering Conference (ASWEC).

[230]  Peter L. Bartlett,et al.  The Rademacher Complexity of Co-Regularized Kernel Classes , 2007, AISTATS.

[231]  Harald C. Gall,et al.  Software Engineering for Machine Learning: A Case Study , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).

[232]  Tsong Yueh Chen,et al.  Metamorphic Testing: A New Approach for Generating Next Test Cases , 2020, ArXiv.

[233]  Yves Le Traon,et al.  Test Selection for Deep Learning Systems , 2019, ACM Trans. Softw. Eng. Methodol..

[234]  Lei Ma,et al.  DeepCT: Tomographic Combinatorial Testing for Deep Learning Systems , 2019, 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[235]  Ananthram Swami,et al.  Distillation as a Defense to Adversarial Perturbations Against Deep Neural Networks , 2015, 2016 IEEE Symposium on Security and Privacy (SP).

[236]  Aurélien Garivier,et al.  On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , 2014, J. Mach. Learn. Res..

[237]  Georgios Fainekos,et al.  Simulation-based Adversarial Test Generation for Autonomous Vehicles with Machine Learning Components , 2018, 2018 IEEE Intelligent Vehicles Symposium (IV).

[238]  Matthew Johnson-Roberson,et al.  Failing to Learn: Autonomously Identifying Perception Failures for Self-Driving Cars , 2017, IEEE Robotics and Automation Letters.

[239]  Daniel Kroening,et al.  Testing Deep Neural Networks , 2018, ArXiv.

[240]  Krishna P. Gummadi,et al.  The Case for Process Fairness in Learning: Feature Selection for Fair Decision Making , 2016 .

[241]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[242]  Ravishankar K. Iyer,et al.  AVFI: Fault Injection for Autonomous Vehicles , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W).

[243]  Sarfraz Khurshid,et al.  Symbolic Execution for Deep Neural Networks , 2018, ArXiv.

[244]  András György,et al.  Detecting Overfitting via Adversarial Examples , 2019, NeurIPS.

[245]  Sanjay Krishnan,et al.  PALM: Machine Learning Explanations For Iterative Debugging , 2017, HILDA@SIGMOD.

[246]  Berkman Sahiner,et al.  On the assessment of the added value of new predictive biomarkers , 2013, BMC Medical Research Methodology.

[247]  Daniel Kroening,et al.  Safety and Trustworthiness of Deep Neural Networks: A Survey , 2018, ArXiv.

[248]  Sanjay Krishnan,et al.  BoostClean: Automated Error Detection and Repair for Machine Learning , 2017, ArXiv.

[249]  Yuriy Brun,et al.  Causal Testing: Understanding Defects' Root Causes , 2018, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[250]  A. Hartman Software and Hardware Testing Using Combinatorial Covering Suites , 2005 .

[251]  A. Jefferson Offutt,et al.  MuJava: an automated class mutation system , 2005, Softw. Test. Verification Reliab..

[252]  Baowen Xu,et al.  Application of Metamorphic Testing to Supervised Classifiers , 2009, 2009 Ninth International Conference on Quality Software.

[253]  Lubomir M. Hadjiiski,et al.  Feature selection and classifier performance in computer-aided diagnosis: the effect of finite sample size. , 2000, Medical physics.

[254]  Krishna P. Gummadi,et al.  Fairness Constraints: Mechanisms for Fair Classification , 2015, AISTATS.

[255]  Been Kim,et al.  Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.

[256]  Shin Nakajima Generalized Oracle for Testing Machine Learning Computer Programs , 2017, SEFM Workshops.

[257]  V. Barnett,et al.  Applied Linear Statistical Models , 1975 .

[258]  Cewu Lu,et al.  Virtual to Real Reinforcement Learning for Autonomous Driving , 2017, BMVC.

[259]  Qiang Yang,et al.  Lifelong Machine Learning Test , 2015, AAAI 2015.

[260]  Ian Goodfellow,et al.  TensorFuzz: Debugging Neural Networks with Coverage-Guided Fuzzing , 2018, ICML.

[261]  Paulo Cortez,et al.  A data-driven approach to predict the success of bank telemarketing , 2014, Decis. Support Syst..

[262]  Christian Murphy,et al.  Parameterizing random test data according to equivalence classes , 2007, RT '07.