The ML test score: A rubric for ML production readiness and technical debt reduction

Creating reliable, production-level machine learning systems brings on a host of concerns not found in small toy examples or even large offline research experiments. Testing and monitoring are key considerations for ensuring the production-readiness of an ML system, and for reducing technical debt of ML systems. But it can be difficult to formulate specific tests, given that the actual prediction behavior of any given model is difficult to specify a priori. In this paper, we present 28 specific tests and monitoring needs, drawn from experience with a wide range of production ML systems to help quantify these issues and present an easy to follow road-map to improve production readiness and pay down ML technical debt.

[1]  Andreas Krause,et al.  Parallelizing Exploration-Exploitation Tradeoffs with Gaussian Process Bandit Optimization , 2012, ICML.

[2]  Xin Zhang,et al.  TFX: A TensorFlow-Based Production-Scale Machine Learning Platform , 2017, KDD.

[3]  Chris Murphy,et al.  An Approach to Software Testing of Machine Learning Applications , 2007, SEKE.

[4]  W. B. Roberts,et al.  Machine Learning: The High Interest Credit Card of Technical Debt , 2014 .

[5]  A. Gawande,et al.  The Checklist Manifesto , 2009 .

[6]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[7]  Stephen McCamant,et al.  The Daikon system for dynamic detection of likely invariants , 2007, Sci. Comput. Program..

[8]  James M. Bieman,et al.  Testing scientific software: A systematic literature review , 2014, Inf. Softw. Technol..

[9]  Adam Tauman Kalai,et al.  Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , 2016, NIPS.

[10]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[11]  Nathan Srebro,et al.  Equality of Opportunity in Supervised Learning , 2016, NIPS.

[12]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[13]  D. Sculley,et al.  Google Vizier: A Service for Black-Box Optimization , 2017, KDD.

[14]  D. Sculley,et al.  TensorFlow Debugger: Debugging Dataflow Graphs for Machine Learning , 2016 .

[15]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[16]  Martin Wattenberg,et al.  Ad click prediction: a view from the trenches , 2013, KDD.