Discovering and Validating AI Errors With Crowdsourced Failure Reports

AI systems can fail to learn important behaviors, leading to real-world issues like safety concerns and biases. Discovering these systematic failures often requires significant developer attention, from hypothesizing potential edge cases to collecting evidence and validating patterns. To scale and streamline this process, we introduce crowdsourced failure reports, end-user descriptions of how or why a model failed, and show how developers can use them to detect AI errors. We also design and implement Deblinder, a visual analytics system for synthesizing failure reports that developers can use to discover and validate systematic failures. In semi-structured interviews and think-aloud studies with 10 AI practitioners, we explore the affordances of the Deblinder system and the applicability of failure reports in real-world settings. Lastly, we show how collecting additional data from the groups identified by developers can improve model performance.

[1]  John T. Stasko,et al.  Combining Computational Analyses and Interactive Visualization for Document Exploration and Sensemaking in Jigsaw , 2013, IEEE Transactions on Visualization and Computer Graphics.

[2]  Steven Franconeri,et al.  Taking Word Clouds Apart: An Empirical Investigation of the Design Space for Keyword Summaries , 2018, IEEE Transactions on Visualization and Computer Graphics.

[3]  e-traces,et al.  Self-Driving Uber Car Kills Pedestrian in Arizona, Where Robots Roam - e-traces , 2018 .

[4]  Thomas Zimmermann,et al.  What Makes a Good Bug Report? , 2008, IEEE Transactions on Software Engineering.

[5]  Benjamin B. Bederson,et al.  Jazz: an extensible zoomable user interface graphics toolkit in Java , 2000, UIST '00.

[6]  Peter Pirolli,et al.  Information Foraging , 2009, Encyclopedia of Database Systems.

[7]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[8]  P. Pirolli,et al.  The Sensemaking Process and Leverage Points for Analyst Technology as Identified Through Cognitive Task Analysis , 2015 .

[9]  Travis Mandel,et al.  Using the Crowd to Prevent Harmful AI Behavior , 2020, Proc. ACM Hum. Comput. Interact..

[10]  Danah Boyd,et al.  Fairness and Abstraction in Sociotechnical Systems , 2019, FAT.

[11]  Timnit Gebru,et al.  Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification , 2018, FAT.

[12]  Thomas Zimmermann,et al.  Duplicate bug reports considered harmful … really? , 2008, 2008 IEEE International Conference on Software Maintenance.

[13]  Andrew D. Selbst,et al.  Big Data's Disparate Impact , 2016 .

[14]  Yu-Ru Lin,et al.  FairSight: Visual Analytics for Fairness in Decision Making , 2019, IEEE Transactions on Visualization and Computer Graphics.

[15]  Aniket Kittur,et al.  Alloy: Clustering with Crowds and Computation , 2016, CHI.

[16]  Gail C. Murphy,et al.  Automatic Summarization of Bug Reports , 2014, IEEE Transactions on Software Engineering.

[17]  Minsuk Kahng,et al.  FAIRVIS: Visual Analytics for Discovering Intersectional Bias in Machine Learning , 2019, 2019 IEEE Conference on Visual Analytics Science and Technology (VAST).

[18]  Tim Kraska,et al.  Slice Finder: Automated Data Slicing for Model Validation , 2018, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[19]  Francisco Herrera,et al.  A unifying view on dataset shift in classification , 2012, Pattern Recognit..

[20]  Panagiotis G. Ipeirotis,et al.  Beat the Machine: Challenging Workers to Find the Unknown Unknowns , 2011, Human Computation.

[21]  Siau-Cheng Khoo,et al.  A discriminative model approach for accurate duplicate bug report retrieval , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[22]  Thomas Zimmermann,et al.  What Makes a Good Bug Report? , 2010, IEEE Trans. Software Eng..

[23]  Martin Wattenberg,et al.  Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV) , 2017, ICML.

[24]  Christopher De Sa,et al.  Data Programming: Creating Large Training Sets, Quickly , 2016, NIPS.

[25]  Hong Mei,et al.  A survey on bug-report analysis , 2015, Science China Information Sciences.

[26]  Chris Russell,et al.  Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR , 2017, ArXiv.

[27]  Svetlana Lazebnik,et al.  Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[28]  Cynthia Breazeal,et al.  Machine behaviour , 2019, Nature.

[29]  Aniket Kittur,et al.  Crowd synthesis: extracting categories and clusters from complex data , 2014, CSCW.

[30]  Nick Cramer,et al.  Automatic Keyword Extraction from Individual Documents , 2010 .

[31]  Mary Reardon Inspector Gadget , 1999, SIGGRAPH '99.

[32]  Jeffrey Heer,et al.  Errudite: Scalable, Reproducible, and Testable Error Analysis , 2019, ACL.

[33]  Chanchal Kumar Roy,et al.  Interactive Visualization of Bug Reports Using Topic Evolution and Extractive Summaries , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[34]  Andreas Kerren,et al.  Text visualization techniques: Taxonomy, visual survey, and community insights , 2015, 2015 IEEE Pacific Visualization Symposium (PacificVis).

[35]  Jarke J. van Wijk,et al.  Supporting the analytical reasoning process in information visualization , 2008, CHI.

[36]  Miroslav Dudík,et al.  Improving Fairness in Machine Learning Systems: What Do Industry Practitioners Need? , 2018, CHI.

[37]  Aniket Kittur,et al.  Apolo: making sense of large network data by combining rich user interaction and machine learning , 2011, CHI.

[38]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[39]  Aniket Kittur,et al.  Sensemaking : Improving Sensemaking by Leveraging the Efforts of Previous Users , 2012 .

[40]  Jordi Pont-Tuset,et al.  The Open Images Dataset V4 , 2018, International Journal of Computer Vision.

[41]  Xiaochen Li,et al.  Toward Better Summarizing Bug Reports With Crowdsourcing Elicited Attributes , 2019, IEEE Transactions on Reliability.

[42]  Christopher Ré,et al.  Slice-based Learning: A Programming Model for Residual Learning in Critical Data Slices , 2019, NeurIPS.

[43]  Steven Euijong Whang,et al.  Inspector gadget , 2020, Proc. VLDB Endow..

[44]  Aniket Kittur,et al.  Standing on the schemas of giants: socially augmented information foraging , 2014, CSCW.

[45]  Rahul Premraj,et al.  Do stack traces help developers fix bugs? , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).

[46]  Eric Horvitz,et al.  Identifying Unknown Unknowns in the Open World: Representations and Policies for Guided Exploration , 2016, AAAI.

[47]  Patrice Y. Simard,et al.  AnchorViz: Facilitating Classifier Error Discovery through Interactive Semantic Data Exploration , 2018, IUI.

[48]  Aniket Kittur,et al.  Costs and benefits of structured information foraging , 2013, CHI.

[49]  Bongshin Lee,et al.  Squares: Supporting Interactive Performance Analysis for Multiclass Classifiers , 2017, IEEE Transactions on Visualization and Computer Graphics.

[50]  Miryung Kim,et al.  Data Scientists in Software Teams: State of the Art and Challenges , 2018, IEEE Transactions on Software Engineering.

[51]  Duen Horng Chau,et al.  Interactive Classification for Deep Learning Interpretation , 2018, ArXiv.

[52]  Siau-Cheng Khoo,et al.  Towards more accurate retrieval of duplicate bug reports , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[53]  Arvinder Kaur,et al.  Automatic Keyword and Sentence-Based Text Summarization for Software Bug Reports , 2020, IEEE Access.

[54]  Eric Horvitz,et al.  Overcoming Blind Spots in the Real World: Leveraging Complementary Abilities for Joint Execution , 2019, AAAI.

[55]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection , 2018, J. Open Source Softw..

[56]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[57]  Eric Horvitz,et al.  Towards Accountable AI: Hybrid Human-Machine Analyses for Characterizing System Failure , 2018, HCOMP.

[58]  Isaac Fung,et al.  Towards Hybrid Human-AI Workflows for Unknown Unknown Detection , 2020, WWW.

[59]  Minsuk Kahng,et al.  Visual exploration of machine learning results using data cube analysis , 2016, HILDA '16.

[60]  Nitesh Goyal Designing for Collaborative Sensemaking: Using Expert & Non-Expert Crowd , 2015, ArXiv.

[61]  Benjamin Recht,et al.  Do ImageNet Classifiers Generalize to ImageNet? , 2019, ICML.

[62]  Michael S. Bernstein,et al.  Flock: Hybrid Crowd-Machine Learning Classifiers , 2015, CSCW.

[63]  Daniel S. Weld,et al.  A Coverage-Based Utility Model for Identifying Unknown Unknowns , 2018, AAAI.