On Human Intellect and Machine Failures: Troubleshooting Integrative Machine Learning Systems

We study the problem of troubleshooting machine learning systems that rely on analytical pipelines of distinct components. Understanding and fixing errors that arise in such integrative systems is difficult as failures can occur at multiple points in the execution workflow. Moreover, errors can propagate, become amplified or be suppressed, making blame assignment difficult. We propose a human-in-the-loop methodology which leverages human intellect for troubleshooting system failures. The approach simulates potential component fixes through human computation tasks and measures the expected improvements in the holistic behavior of the system. The method provides guidance to designers about how they can best improve the system. We demonstrate the effectiveness of the approach on an automated image captioning system that has been pressed into real-world use.

[1]  Eric Horvitz,et al.  What Went Wrong and Why? Diagnosing Situated Interaction Failures in the Wild , 2017, ICSR.

[2]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[3]  Adnan Darwiche,et al.  Model-Based Diagnosis under Real-World Constraints , 2000, AI Mag..

[4]  D. Sculley,et al.  TensorFlow Debugger: Debugging Dataflow Graphs for Machine Learning , 2016 .

[5]  Paul A. Viola,et al.  Multiple Instance Boosting for Object Detection , 2005, NIPS.

[6]  Adam Tauman Kalai,et al.  Crowdsourcing Feature Discovery via Adaptively Chosen Comparisons , 2015, HCOMP.

[7]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  D. Sculley,et al.  Hidden Technical Debt in Machine Learning Systems , 2015, NIPS.

[9]  David Heckerman,et al.  Decision-Theoretic Troubleshooting: A Framework for Repair and Experiment , 1996, UAI.

[10]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[11]  Aniket Kittur,et al.  Alloy: Clustering with Crowds and Computation , 2016, CHI.

[12]  S. M. García,et al.  2014: , 2020, A Party for Lazarus.

[13]  John R. Smith,et al.  Trainable performance upper bounds for image and video captioning , 2015, ArXiv.

[14]  Devi Parikh Human-Debugging of Machines , 2011 .

[15]  François Bry,et al.  Human computation , 2018, it Inf. Technol..

[16]  C. Lawrence Zitnick,et al.  Finding the weakest link in person detectors , 2011, CVPR 2011.

[17]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[18]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[19]  Panagiotis G. Ipeirotis,et al.  Get another label? improving data quality and data mining using multiple, noisy labelers , 2008, KDD.

[20]  Yiannis Aloimonos,et al.  Corpus-Guided Sentence Generation of Natural Images , 2011, EMNLP.

[21]  Ming-Wei Chang,et al.  Semantic Parsing via Staged Query Graph Generation: Question Answering with Knowledge Base , 2015, ACL.

[22]  Weng-Keen Wong,et al.  Explanatory Debugging: Supporting End-User Debugging of Machine-Learned Programs , 2010, VL/HCC.

[23]  James A. Landay,et al.  Gestalt: integrated support for implementation and analysis in machine learning , 2010, UIST.

[24]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[25]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[26]  Sean Andrist,et al.  Rapid development of multimodal interactive systems: a demonstration of platform for situated intelligence , 2017, ICMI.

[28]  Liming Zhu,et al.  Continuous Validation for Data Analytics Systems , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C).

[29]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[30]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[31]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[33]  John R. Smith,et al.  Oracle Performance for Visual Captioning , 2016, BMVC.

[34]  Panagiotis G. Ipeirotis,et al.  Beat the Machine: Challenging Workers to Find the Unknown Unknowns , 2011, Human Computation.

[35]  Michael S. Bernstein,et al.  Flock: Hybrid Crowd-Machine Learning Classifiers , 2015, CSCW.

[36]  Sriram K. Rajamani,et al.  Debugging Machine Learning Tasks , 2016, ArXiv.