Assuring the Machine Learning Lifecycle

Machine learning has evolved into an enabling technology for a wide range of highly successful applications. The potential for this success to continue and accelerate has placed machine learning (ML) at the top of research, economic, and political agendas. Such unprecedented interest is fuelled by a vision of ML applicability extending to healthcare, transportation, defence, and other domains of great societal importance. Achieving this vision requires the use of ML in safety-critical applications that demand levels of assurance beyond those needed for current ML applications. Our article provides a comprehensive survey of the state of the art in the assurance of ML, i.e., in the generation of evidence that ML is sufficiently safe for its intended use. The survey covers the methods capable of providing such evidence at different stages of the machine learning lifecycle, i.e., of the complex, iterative process that starts with the collection of the data used to train an ML component for a system, and ends with the deployment of that component within the system. The article begins with a systematic presentation of the ML lifecycle and its stages. We then define assurance desiderata for each stage, review existing methods that contribute to achieving these desiderata, and identify open challenges that require further research.

[1]  Jinfeng Yi,et al.  Evaluating the Robustness of Nearest Neighbor Classifiers: A Primal-Dual Perspective , 2019, ArXiv.

[2]  Padhraic Smyth,et al.  Bounds on the mean classification error rate of multiple experts , 1996, Pattern Recognit. Lett..

[3]  Peter G. Bishop,et al.  Safety and Assurance Cases: Past, Present and Possible Future - an Adelard Perspective , 2010, SSS.

[4]  Adnan Darwiche,et al.  Human-level intelligence or animal-like abilities? , 2017, Commun. ACM.

[5]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[6]  A. Taber,et al.  Single event upset in avionics , 1993 .

[7]  Steven R. Young,et al.  Optimizing deep learning hyper-parameters through an evolutionary algorithm , 2015, MLHPC@SC.

[8]  Ashish Tiwari,et al.  Sherlock - A tool for verification of neural network feedback systems: demo abstract , 2019, HSCC.

[9]  Ron Kohavi,et al.  The Case against Accuracy Estimation for Comparing Induction Algorithms , 1998, ICML.

[10]  Kevin Leyton-Brown,et al.  Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms , 2012, KDD.

[11]  Andrew Slavin Ross,et al.  Improving the Adversarial Robustness and Interpretability of Deep Neural Networks by Regularizing their Input Gradients , 2017, AAAI.

[12]  Alwyn E. Goodloe,et al.  Challenges in the Verification of Reinforcement Learning Algorithms , 2017 .

[13]  Spyros Makridakis,et al.  The Forthcoming Artificial Intelligence (AI) Revolution: Its Impact on Society and Firms , 2017 .

[14]  Seyed-Mohsen Moosavi-Dezfooli,et al.  Robustness of classifiers: from adversarial to random noise , 2016, NIPS.

[15]  Suman Jana,et al.  DeepTest: Automated Testing of Deep-Neural-Network-Driven Autonomous Cars , 2017, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[16]  Andrea Vedaldi,et al.  Understanding deep image representations by inverting them , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Daniel Kroening,et al.  Concolic Testing for Deep Neural Networks , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[18]  Lior Rokach,et al.  Ensemble learning: A survey , 2018, WIREs Data Mining Knowl. Discov..

[19]  Lawrence K. Saul,et al.  Think Globally, Fit Locally: Unsupervised Learning of Low Dimensional Manifold , 2003, J. Mach. Learn. Res..

[20]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[21]  Frank Hutter,et al.  Back to Basics: Benchmarking Canonical Evolution Strategies for Playing Atari , 2018, IJCAI.

[22]  Mykel J. Kochenderfer,et al.  Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks , 2017, CAV.

[23]  Lars Schmidt-Thieme,et al.  Beyond Manual Tuning of Hyperparameters , 2015, KI - Künstliche Intelligenz.

[24]  Adnan Darwiche,et al.  Formal Verification of Bayesian Network Classifiers , 2018, PGM.

[25]  Geoff Nitschke,et al.  Improving Deep Learning using Generic Data Augmentation , 2017 .

[26]  Weiming Xiang,et al.  NNV: The Neural Network Verification Tool for Deep Neural Networks and Learning-Enabled Cyber-Physical Systems , 2020, CAV.

[27]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[28]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[29]  Federico Girosi,et al.  On the Relationship between Generalization Error, Hypothesis Complexity, and Sample Complexity for Radial Basis Functions , 1996, Neural Computation.

[30]  Andrew Rae,et al.  Situation coverage – a coverage criterion for testing autonomous robots , 2015 .

[31]  Min Wu,et al.  Safety Verification of Deep Neural Networks , 2016, CAV.

[32]  Karen M. Feigh,et al.  Learning From Explanations Using Sentiment and Advice in RL , 2017, IEEE Transactions on Cognitive and Developmental Systems.

[33]  Dimitris Kanellopoulos,et al.  Data Preprocessing for Supervised Leaning , 2007 .

[34]  B. Wujek,et al.  Automated Hyperparameter Tuning for Effective Machine Learning , 2017 .

[35]  Toon Calders,et al.  Data preprocessing techniques for classification without discrimination , 2011, Knowledge and Information Systems.

[36]  Ali Ghodsi,et al.  Accelerating the Machine Learning Lifecycle with MLflow , 2018, IEEE Data Eng. Bull..

[37]  Yue Zhao,et al.  CommanderSong: A Systematic Approach for Practical Adversarial Voice Recognition , 2018, USENIX Security Symposium.

[38]  Radu Calinescu,et al.  Assured Reinforcement Learning with Formally Verified Abstract Policies , 2017, ICAART.

[39]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[40]  Simin Nadjm-Tehrani,et al.  Formal Verification of Random Forests in Safety-Critical Applications , 2018, FTSCS.

[41]  Samuel J. Gershman,et al.  Human-in-the-Loop Interpretability Prior , 2018, NeurIPS.

[42]  Lior Rokach,et al.  Recommender Systems: Introduction and Challenges , 2015, Recommender Systems Handbook.

[43]  Atilla Bulmus,et al.  Over the Air Software Update Realization within Generic Modules with Microcontrollers Using External Serial FLASH , 2017 .

[44]  P. R. Caseley Claims and Architectures to Rationate on Automatic and Autonomous Functions , 2016 .

[45]  Sonja Kuhnt,et al.  Design and analysis of computer experiments , 2010 .

[46]  Nathalie Japkowicz,et al.  Concept-Learning in the Presence of Between-Class and Within-Class Imbalances , 2001, Canadian Conference on AI.

[47]  Dimitrios N. Serpanos,et al.  A rigorous and efficient run-time security monitor for real-time critical embedded system applications , 2016, 2016 IEEE 3rd World Forum on Internet of Things (WF-IoT).

[48]  Carlos Eduardo Scheidegger,et al.  Certifying and Removing Disparate Impact , 2014, KDD.

[49]  Angelo Ferrando,et al.  Verifying and Validating Autonomous Systems: Towards an Integrated Approach , 2018, RV.

[50]  Thai Son Hoang,et al.  Formal Development of Policing Functions for Intelligent Systems , 2017, 2017 IEEE 28th International Symposium on Software Reliability Engineering (ISSRE).

[51]  Ian Goodfellow,et al.  TensorFuzz: Debugging Neural Networks with Coverage-Guided Fuzzing , 2018, ICML.

[52]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[53]  Sanjit A. Seshia,et al.  VERIFAI: A Toolkit for the Design and Analysis of Artificial Intelligence-Based Systems , 2019, ArXiv.

[54]  Tim Kelly,et al.  Devil's in the Detail: Through-Life Safety and Security Co-assurance Using SSAF , 2019, SAFECOMP.

[55]  Kiri Wagstaff,et al.  K-means in space: a radiation sensitivity evaluation , 2009, ICML '09.

[56]  Antonio M. López,et al.  The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  M. Humayun Kabir,et al.  Machine Learning Based Adaptive Context-Aware System for Smart Home Environment , 2015 .

[58]  Weiming Xiang,et al.  Verification of Deep Convolutional Neural Networks Using ImageStars , 2020, CAV.

[59]  Binghui Wang,et al.  Stealing Hyperparameters in Machine Learning , 2018, 2018 IEEE Symposium on Security and Privacy (SP).

[60]  Brendan Dolan-Gavitt,et al.  BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain , 2017, ArXiv.

[61]  Lei Ma,et al.  DeepGauge: Multi-Granularity Testing Criteria for Deep Learning Systems , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[62]  Ivan Laptev,et al.  Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[63]  Saharon Rosset,et al.  Leakage in data mining: formulation, detection, and avoidance , 2011, TKDD.

[64]  Michael Forsting,et al.  Machine Learning Will Change Medicine , 2017, The Journal of Nuclear Medicine.

[65]  Been Kim,et al.  Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.

[66]  Deepak Kumar,et al.  Supervised heterogeneous transfer learning using random forests , 2018, COMAD/CODS.

[67]  Razvan Andonie,et al.  Big Holes in Big Data: A Monte Carlo Algorithm for Detecting Large Hyper-Rectangles in High Dimensional Data , 2016, 2016 IEEE 40th Annual Computer Software and Applications Conference (COMPSAC).

[68]  Christian Gagné,et al.  Controlling Over-generalization and its Effect on Adversarial Examples Generation and Detection , 2018, ArXiv.

[69]  Francesco Ranzato,et al.  Robustness Verification of Support Vector Machines , 2019, SAS.

[70]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[71]  Nathalie Japkowicz,et al.  Assessing the Impact of Changing Environments on Classifier Performance , 2008, Canadian Conference on AI.

[72]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[73]  Cynthia Rudin,et al.  Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead , 2018, Nature Machine Intelligence.

[74]  Jack P. C. Kleijnen,et al.  EUROPEAN JOURNAL OF OPERATIONAL , 1992 .

[75]  Michael Backes,et al.  Don't Trigger Me! A Triggerless Backdoor Attack Against Deep Neural Networks , 2020, ArXiv.

[76]  Matthieu Roy,et al.  SMOF: A Safety Monitoring Framework for Autonomous Systems , 2018, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[77]  Kamyar Azizzadenesheli,et al.  Regularized Learning for Domain Adaptation under Label Shifts , 2019, ICLR.

[78]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[79]  Yunfeng Zhang,et al.  AI Fairness 360: An extensible toolkit for detecting and mitigating algorithmic bias , 2019, IBM Journal of Research and Development.

[80]  Dawn Xiaodong Song,et al.  Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning , 2017, ArXiv.

[81]  Mykel J. Kochenderfer,et al.  The Marabou Framework for Verification and Analysis of Deep Neural Networks , 2019, CAV.

[82]  Mark D. McDonnell,et al.  Understanding Data Augmentation for Classification: When to Warp? , 2016, 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA).

[83]  Junfeng Yang,et al.  DeepXplore: Automated Whitebox Testing of Deep Learning Systems , 2017, SOSP.

[84]  Xiangyu Zhang,et al.  ABS: Scanning Neural Networks for Back-doors by Artificial Brain Stimulation , 2019, CCS.

[85]  Lutz Prechelt,et al.  Early Stopping - But When? , 2012, Neural Networks: Tricks of the Trade.

[86]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .

[87]  Robert C. Holte,et al.  Cost curves: An improved method for visualizing classifier performance , 2006, Machine Learning.

[88]  Lui Sha,et al.  Real-Time Reachability for Verified Simplex Design , 2014, 2014 IEEE Real-Time Systems Symposium.

[89]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[90]  Zachary C. Lipton,et al.  What is the Effect of Importance Weighting in Deep Learning? , 2018, ICML.

[91]  Philip Koopman,et al.  Robustness Testing of Autonomy Software , 2017, 2018 IEEE/ACM 40th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP).

[92]  Philip Koopman,et al.  Credible Autonomy Safety Argumentation , 2018 .

[93]  Swarat Chaudhuri,et al.  AI2: Safety and Robustness Certification of Neural Networks with Abstract Interpretation , 2018, 2018 IEEE Symposium on Security and Privacy (SP).

[94]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[95]  Steffen Bickel,et al.  Discriminative Learning Under Covariate Shift , 2009, J. Mach. Learn. Res..

[96]  Mark A. Neerincx,et al.  ICM: An Intuitive Model Independent and Accurate Certainty Measure for Machine Learning , 2018, ICAART.

[97]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[98]  Davide Anguita,et al.  Human Activity Recognition on Smartphones Using a Multiclass Hardware-Friendly Support Vector Machine , 2012, IWAAL.

[99]  Victor S. Sheng,et al.  Machine Learning with Crowdsourcing: A Brief Summary of the Past Research and Future Directions , 2019, AAAI.

[100]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[101]  Matthew Hill,et al.  "Boxing Clever": Practical Techniques for Gaining Insights into Training Data and Monitoring Distribution Shift , 2018, SAFECOMP Workshops.

[102]  Philip Koopman,et al.  How Many Operational Design Domains, Objects, and Events? , 2019, SafeAI@AAAI.

[103]  Jeffrey O. Kephart,et al.  The Vision of Autonomic Computing , 2003, Computer.

[104]  Matthias Hein,et al.  Provably Robust Boosted Decision Stumps and Trees against Adversarial Attacks , 2019, NeurIPS.

[105]  Davide Anguita,et al.  Transition-Aware Human Activity Recognition Using Smartphones , 2016, Neurocomputing.

[106]  Dorian Pyle,et al.  Data Preparation for Data Mining , 1999 .

[107]  Kurt Driessens,et al.  Transfer Learning in Reinforcement Learning Problems Through Partial Policy Recycling , 2007, ECML.

[108]  Yan Liu,et al.  Medical data mining: insights from winning two competitions , 2010, Data Mining and Knowledge Discovery.

[109]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[110]  Ben Y. Zhao,et al.  Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks , 2019, 2019 IEEE Symposium on Security and Privacy (SP).

[111]  Andreas Rausch,et al.  Towards the Verification of Safety-critical Autonomous Systems in Dynamic Environments , 2016, V2CPS@IFM.

[112]  Zongxu Pan,et al.  Transfer Learning with Deep Convolutional Neural Network for SAR Target Classification with Limited Labeled Data , 2017, Remote. Sens..

[113]  Francisco Herrera,et al.  A unifying view on dataset shift in classification , 2012, Pattern Recognit..

[114]  Lei Ma,et al.  DeepCT: Tomographic Combinatorial Testing for Deep Learning Systems , 2019, 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[115]  Danny Weyns,et al.  Engineering Trustworthy Self-Adaptive Software with Dynamic Assurance Cases , 2017, IEEE Transactions on Software Engineering.

[116]  Deepak S. Turaga,et al.  Feature Engineering for Predictive Modeling using Reinforcement Learning , 2017, AAAI.

[117]  Douglas C. Montgomery,et al.  Design of experiments and response surface methodology to tune machine learning hyperparameters, with a random forest case-study , 2018, Expert Syst. Appl..

[118]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[119]  Aurélien Géron,et al.  Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems , 2017 .

[120]  Youyong Kong,et al.  Deep Direct Reinforcement Learning for Financial Signal Representation and Trading , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[121]  Aldo A. Faisal,et al.  The Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in intensive care , 2018, Nature Medicine.

[122]  Constance L. Heitmeyer,et al.  Automated consistency checking of requirements specifications , 1996, TSEM.

[123]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[124]  Jeff Heaton,et al.  An empirical analysis of feature engineering for predictive modeling , 2016, SoutheastCon 2016.

[125]  Barry Boehm,et al.  Spiral Development: Experience, Principles, and Refinements , 2000 .

[126]  Christoph Schmittner,et al.  Management of automotive software updates , 2020, Microprocess. Microsystems.

[127]  Gerd Ascheid,et al.  Efficient On-Line Error Detection and Mitigation for Deep Neural Network Accelerators , 2018, SAFECOMP.

[128]  Seyed-Mohsen Moosavi-Dezfooli,et al.  Universal Adversarial Perturbations , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[129]  Ananthram Swami,et al.  Practical Black-Box Attacks against Machine Learning , 2016, AsiaCCS.

[130]  Ivona Brandic,et al.  Revealing the MAPE loop for the autonomic management of Cloud infrastructures , 2011, 2011 IEEE Symposium on Computers and Communications (ISCC).

[131]  David A. Cieslak,et al.  A framework for monitoring classifiers’ performance: when and why failure occurs? , 2009, Knowledge and Information Systems.

[132]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[133]  Guy Lapalme,et al.  A systematic analysis of performance measures for classification tasks , 2009, Inf. Process. Manag..

[134]  David L. Dill,et al.  Developing Bug-Free Machine Learning Systems With Formal Mathematics , 2017, ICML.

[135]  Yang Song,et al.  Improving the Robustness of Deep Neural Networks via Stability Training , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[136]  Pascal Frossard,et al.  Fundamental limits on adversarial robustness , 2015, ICML 2015.

[137]  Hermann Winner,et al.  Autonomous Driving: Technical, Legal and Social Aspects , 2016 .

[138]  Amina Adadi,et al.  Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI) , 2018, IEEE Access.

[139]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[140]  Maayan Harel,et al.  Learn on Source, Refine on Target: A Model Transfer Learning Framework with Random Forests , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[141]  Sarfraz Khurshid,et al.  DeepRoad: GAN-based Metamorphic Autonomous Driving System Testing , 2018, ArXiv.

[142]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[143]  Stephan Günnemann,et al.  Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift , 2018, NeurIPS.

[144]  Jeffrey Xu Yu,et al.  Mining Changes of Classification by Correspondence Tracing , 2003, SDM.

[145]  Heng-Tze Cheng,et al.  Wide & Deep Learning for Recommender Systems , 2016, DLRS@RecSys.

[146]  Danny Weyns,et al.  MAPE-K Formal Templates to Rigorously Design Behaviors for Self-Adaptive Systems , 2015, ACM Trans. Auton. Adapt. Syst..

[147]  Peter Flach,et al.  Performance Evaluation in Machine Learning: The Good, the Bad, the Ugly, and the Way Forward , 2019, AAAI.

[148]  Jörg Schwenk,et al.  SoK: Lessons Learned from SSL/TLS Attacks , 2013, WISA.

[149]  Arijit Ghosh,et al.  Uniformity of Point Samples in Metric Spaces Using Gap Ratio , 2014, SIAM J. Discret. Math..

[150]  Bernd Bischl,et al.  Tunability: Importance of Hyperparameters of Machine Learning Algorithms , 2018, J. Mach. Learn. Res..

[151]  Mariarosaria Taddeo,et al.  Trusting artificial intelligence in cybersecurity is a double-edged sword , 2019, Nature Machine Intelligence.

[152]  Andreas Geiger,et al.  Augmented Reality Meets Computer Vision: Efficient Data Generation for Urban Driving Scenes , 2017, International Journal of Computer Vision.

[153]  Cristian S. Calude,et al.  The Deluge of Spurious Correlations in Big Data , 2016, Foundations of Science.

[154]  Rüdiger Ehlers,et al.  Formal Verification of Piece-Wise Linear Feed-Forward Neural Networks , 2017, ATVA.