Estimating Example Difficulty using Variance of Gradients

In machine learning, a question of great interest is understanding what examples are challenging for a model to classify. Identifying atypical examples helps inform safe deployment of models, isolates examples that require further human inspection, and provides interpretability into model behavior. In this work, we propose Variance of Gradients (VOG) as a proxy metric for detecting outliers in the data distribution. We provide quantitative and qualitative support that VOG is a meaningful way to rank data by difficulty and to surface a tractable subset of the most challenging examples for human-in-the-loop auditing. Data points with high VOG scores are more difficult for the model to classify and over-index on examples that require memorization.

[1]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[2]  Peter L. Bartlett,et al.  Classification with a Reject Option using a Hinge Loss , 2008, J. Mach. Learn. Res..

[3]  Úlfar Erlingsson,et al.  Distribution Density, Tails, and Outliers in Machine Learning: Metrics and Applications , 2019, ArXiv.

[4]  Kevin Gimpel,et al.  A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks , 2016, ICLR.

[5]  Cynthia Rudin,et al.  Deep Learning for Case-based Reasoning through Prototypes: A Neural Network that Explains its Predictions , 2017, AAAI.

[6]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[7]  Rich Caruana,et al.  Case-Based Explanation for Artificial Neural Nets , 2000, ANNIMAB.

[8]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[9]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[10]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Motoaki Kawanabe,et al.  How to Explain Individual Classification Decisions , 2009, J. Mach. Learn. Res..

[12]  Marcus A. Badgeley,et al.  Deep learning predicts hip fracture using confounding patient and healthcare variables , 2018, npj Digital Medicine.

[13]  David B. Paradice,et al.  3D deep learning for detecting pulmonary nodules in CT scans , 2018, J. Am. Medical Informatics Assoc..

[14]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[15]  Thomas G. Dietterich,et al.  A Unifying Review of Deep and Shallow Anomaly Detection , 2020, Proceedings of the IEEE.

[16]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[17]  Dawn Song,et al.  Pretrained Transformers Improve Out-of-Distribution Robustness , 2020, ACL.

[18]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[19]  Emily Denton,et al.  Characterising Bias in Compressed Models , 2020, ArXiv.

[20]  Stefano Soatto,et al.  Critical Learning Periods in Deep Networks , 2018, ICLR.

[21]  Avanti Shrikumar,et al.  Learning Important Features Through Propagating Activation Differences , 2017, ICML.

[22]  Jianping Zhang,et al.  Selecting Typical Instances in Instance-Based Learning , 1992, ML.

[23]  M. Rosenblatt Remarks on Some Nonparametric Estimates of a Density Function , 1956 .

[24]  Elena Spitzer,et al.  "What We Can't Measure, We Can't Understand": Challenges to Demographic Data Procurement in the Pursuit of Fairness , 2020, ArXiv.

[25]  Vitaly Feldman,et al.  Does learning require memorization? a short tale about a long tail , 2019, STOC.

[26]  R. W. Barsness,et al.  The Department of Transportation , 1970 .

[27]  Mehryar Mohri,et al.  Boosting with Abstention , 2016, NIPS.

[28]  Justin Gilmer,et al.  MNIST-C: A Robustness Benchmark for Computer Vision , 2019, ArXiv.

[29]  Siegfried Wahl,et al.  Leveraging uncertainty information from deep neural networks for disease detection , 2016, Scientific Reports.

[30]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[31]  Chirag Agarwal,et al.  A Tale Of Two Long Tails , 2021, ArXiv.

[32]  Yongdong Zhang,et al.  Automated pulmonary nodule detection in CT images using deep convolutional neural networks , 2019, Pattern Recognit..

[33]  Cynthia Rudin,et al.  The Bayesian Case Model: A Generative Approach for Case-Based Reasoning and Prototype Classification , 2014, NIPS.

[34]  Martin Wattenberg,et al.  SmoothGrad: removing noise by adding noise , 2017, ArXiv.

[35]  Sara Hooker,et al.  The hardware lottery , 2020, Commun. ACM.

[36]  R. Srikant,et al.  Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks , 2017, ICLR.

[37]  Alex Kendall,et al.  What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? , 2017, NIPS.

[38]  Percy Liang,et al.  Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[39]  Welch Bl THE GENERALIZATION OF ‘STUDENT'S’ PROBLEM WHEN SEVERAL DIFFERENT POPULATION VARLANCES ARE INVOLVED , 1947 .

[40]  Stefano Soatto,et al.  Critical Learning Periods in Deep Neural Networks , 2017, ArXiv.

[41]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[42]  Oluwasanmi Koyejo,et al.  Examples are not enough, learn to criticize! Criticism for Interpretability , 2016, NIPS.

[43]  Gustavo Carneiro,et al.  Hidden stratification causes clinically meaningful failures in machine learning for medical imaging , 2019, CHIL.

[44]  Aaron C. Courville,et al.  What Do Compressed Deep Neural Networks Forget , 2019, 1911.05248.

[45]  R. Tibshirani,et al.  Prototype selection for interpretable classification , 2011, 1202.5933.

[46]  Ziheng Jiang,et al.  Exploring the Memorization-Generalization Continuum in Deep Learning , 2020, ArXiv.

[47]  Dawn Song,et al.  Natural Adversarial Examples , 2019, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  David J. Fleet,et al.  A Study of Gradient Variance in Deep Learning , 2020, ArXiv.

[49]  Vinay Uday Prabhu,et al.  Do deep neural networks learn shallow learnable examples first , 2019 .

[50]  Andrew McCallum,et al.  Active Bias: Training More Accurate Neural Networks by Emphasizing High Variance Samples , 2017, NIPS.

[51]  Ziheng Jiang,et al.  Characterizing Structural Regularities of Labeled Data in Overparameterized Models , 2020, ICML.

[52]  Ankur Taly,et al.  Axiomatic Attribution for Deep Networks , 2017, ICML.

[53]  Jason Yosinski,et al.  When does loss-based prioritization fail? , 2021, ArXiv.

[54]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[55]  Michael Veale,et al.  Fairer machine learning in the real world: Mitigating discrimination without collecting sensitive data , 2017, Big Data Soc..

[56]  Dumitru Erhan,et al.  A Benchmark for Interpretability Methods in Deep Neural Networks , 2018, NeurIPS.

[57]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[59]  Ahmet Murat Ozbayoglu,et al.  Deep Learning for Financial Applications : A Survey , 2020, Appl. Soft Comput..

[60]  Stefano Soatto,et al.  Estimating informativeness of samples with Smooth Unique Information , 2021, ICLR.

[61]  Douglas M. Hawkins,et al.  The Detection of Errors in Multivariate Data Using Principal Components , 1974 .

[62]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .