Technology Readiness Levels for Machine Learning Systems

The development and deployment of machine learning systems can be executed easily with modern tools, but the process is typically rushed and means-to-an-end. The lack of diligence can lead to technical debt, scope creep and misaligned objectives, model misuse and failures, and expensive consequences. Engineering systems, on the other hand, follow well-defined processes and testing standards to streamline development for high-quality, reliable results. The extreme is spacecraft systems, where mission critical measures and robustness are ingrained in the development process. Drawing on experience in both spacecraft engineering and AI/ML (from research through product), we propose a proven systems engineering approach for machine learning development and deployment. Our Technology Readiness Levels for ML (TRL4ML) framework defines a principled process to ensure robust systems while being streamlined for ML research and product, including key distinctions from traditional software engineering. Even more, TRL4ML defines a common language for people across the organization to work collaboratively on ML technologies.

[1]  Zoubin Ghahramani,et al.  Probabilistic machine learning and artificial intelligence , 2015, Nature.

[2]  David Leslie,et al.  Understanding artificial intelligence ethics and safety , 2019, ArXiv.

[3]  Wojciech Zaremba,et al.  Domain randomization for transferring deep neural networks from simulation to the real world , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[4]  Roberto Cipolla,et al.  Concrete Problems for Autonomous Vehicle Safety: Advantages of Bayesian Deep Learning , 2017, IJCAI.

[5]  Nando de Freitas,et al.  Taking the Human Out of the Loop: A Review of Bayesian Optimization , 2016, Proceedings of the IEEE.

[6]  Amy P. Abernethy,et al.  Harnessing the Power of Real‐World Evidence (RWE): A Checklist to Ensure Regulatory‐Grade Data Quality , 2017, Clinical pharmacology and therapeutics.

[7]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[8]  Alexander Lavin,et al.  Manifolds for Unsupervised Visual Anomaly Detection , 2020, ArXiv.

[9]  Peter Jenniskens,et al.  CAMS: Cameras for Allsky Meteor Surveillance to establish minor meteor showers , 2011 .

[10]  Neil D. Lawrence,et al.  Challenges in Deploying Machine Learning: A Survey of Case Studies , 2020, ACM Comput. Surv..

[11]  N. Moskovitz,et al.  A survey of southern hemisphere meteor showers , 2018 .

[12]  nasa,et al.  NASA Systems Engineering Handbook , 2007 .

[13]  Alexander D'Amour,et al.  Underspecification Presents Challenges for Credibility in Modern Machine Learning , 2020, J. Mach. Learn. Res..

[14]  Leonard E. Miller,et al.  NASA systems engineering handbook , 1995 .

[15]  D. Sculley,et al.  The ML test score: A rubric for ML production readiness and technical debt reduction , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[16]  Brian W. Powers,et al.  Dissecting racial bias in an algorithm used to manage the health of populations , 2019, Science.

[17]  Amit Sharma,et al.  Split-Treatment Analysis to Rank Heterogeneous Causal Effects for Prospective Interventions , 2020, WSDM.

[18]  D. Sculley,et al.  What’s your ML test score? A rubric for ML production systems , 2016 .

[19]  Prabhat,et al.  Etalumis: bringing probabilistic programming to scientific simulators at scale , 2019, SC.

[20]  J. Robins,et al.  Double/Debiased Machine Learning for Treatment and Structural Parameters , 2017 .

[21]  Ciarán M Lee,et al.  Improving the accuracy of medical diagnosis with causal machine learning , 2020, Nature Communications.

[22]  Hongseok Yang,et al.  An Introduction to Probabilistic Programming , 2018, ArXiv.

[23]  Gary S Collins,et al.  Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI Extension , 2020, Nature Medicine.

[24]  Anit Kumar Sahu,et al.  Federated Learning: Challenges, Methods, and Future Directions , 2019, IEEE Signal Processing Magazine.

[25]  Ankur Taly,et al.  Explainable machine learning in deployment , 2019, FAT*.

[26]  Alexander Lavin,et al.  Technology Readiness Levels for AI & ML , 2020 .

[27]  Inioluwa Deborah Raji,et al.  Closing the AI accountability gap: defining an end-to-end framework for internal algorithmic auditing , 2020, FAT*.

[28]  Alexei Botchkarev,et al.  A New Typology Design of Performance Metrics to Measure Errors in Machine Learning Regression Algorithms , 2019, Interdisciplinary Journal of Information, Knowledge, and Management.

[29]  Daniel Rueckert,et al.  A generic framework for privacy preserving deep learning , 2018, ArXiv.

[30]  Harald C. Gall,et al.  Software Engineering for Machine Learning: A Case Study , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).

[31]  Josh Veitch-Michaelis,et al.  Learnings from Frontier Development Lab and SpaceML - AI Accelerators for NASA and ESA , 2020, ArXiv.

[32]  Neoklis Polyzotis,et al.  Data Validation for Machine Learning , 2019, SysML.

[33]  J. von Neumann,et al.  Probabilistic Logic and the Synthesis of Reliable Organisms from Unreliable Components , 1956 .

[34]  Daniel L. Rubin,et al.  Regulatory Frameworks for Development and Evaluation of Artificial Intelligence–Based Diagnostic Imaging Algorithms: Summary and Recommendations , 2020, Journal of the American College of Radiology.

[35]  Nijs Jan Duijm,et al.  Recommendations on the use and design of risk matrices , 2015 .

[36]  Stefan Hinterstoißer,et al.  An Annotation Saved is an Annotation Earned: Using Fully Synthetic Training for Object Instance Detection , 2019, ArXiv.

[37]  Inioluwa Deborah Raji,et al.  Model Cards for Model Reporting , 2018, FAT.

[38]  Gilles Louppe,et al.  The frontier of simulation-based inference , 2020, Proceedings of the National Academy of Sciences.

[39]  Victor Veitch,et al.  Sense and Sensitivity Analysis: Simple Post-Hoc Analysis of Bias Due to Unobserved Confounding , 2020, NeurIPS.

[40]  Searching for Long-Period Comets with Deep Learning Tools , 2017 .

[41]  F. Siegert,et al.  Event generation with SHERPA 1.1 , 2008, 0811.4622.

[42]  Kendra Albert,et al.  Failure Modes in Machine Learning Systems , 2019, ArXiv.

[43]  Zoe Szajnfarber,et al.  Managing Innovation in Architecturally Hierarchical Systems: Three Switchback Mechanisms That Impact Practice , 2014, IEEE Transactions on Engineering Management.

[44]  Baowen Xu,et al.  Testing and validating machine learning classifiers by metamorphic testing , 2011, J. Syst. Softw..

[45]  Dean Eckles,et al.  Bias and High-Dimensional Adjustment in Observational Studies of Peer Effects , 2017, ArXiv.

[46]  G F Cooper,et al.  The use of misclassification costs to learn rule-based decision support models for cost-effective hospital admission strategies. , 1995, Proceedings. Symposium on Computer Applications in Medical Care.

[47]  G. Collins,et al.  Double-adjustment in propensity score matching analysis: choosing a threshold for considering residual imbalance , 2017, BMC Medical Research Methodology.

[48]  Marwan Mattar,et al.  Unity: A General Platform for Intelligent Agents , 2018, ArXiv.

[49]  Judea Pearl,et al.  Theoretical Impediments to Machine Learning With Seven Sparks from the Causal Revolution , 2018, WSDM.

[50]  D. Sculley,et al.  Hidden Technical Debt in Machine Learning Systems , 2015, NIPS.

[51]  Sameer Singh,et al.  Beyond Accuracy: Behavioral Testing of NLP Models with CheckList , 2020, ACL.

[52]  Suchi Saria,et al.  Reliable Decision Support using Counterfactual Models , 2017, NIPS.

[53]  Hao Zhou,et al.  Comparative Study of OKR and KPI , 2018 .