Auditing Black-box Models by Obscuring Features

Data-trained predictive models are widely used to assist in decision making. But they are used as black boxes that output a prediction or score. It is therefore hard to acquire a deeper understanding of model behavior: and in particular how different attributes influence the model prediction. This is very important when trying to interpret the behavior of complex models, or ensure that certain problematic attributes (like race or gender) are not unduly influencing decisions. In this paper, we present a technique for auditing black-box models: we can study the extent to which existing models take advantage of particular features in the dataset without knowing how the models work. We show how a class of techniques originally developed for the detection and repair of disparate impact in classification models can be used to study the sensitivity of any model with respect to any feature subsets. Our approach does not require the black-box model to be retrained. This is important if (for example) the model is only accessible via an API, and contrasts our work with other methods that investigate feature influence like feature selection. We present experimental evidence for the effectiveness of our procedure using a variety of publicly available datasets and models. We also validate our procedure using techniques from interpretable learning and feature selection.

[1]  D. Freedman,et al.  On the histogram as a density estimator:L2 theory , 1981 .

[2]  Jack P. C. Kleijnen,et al.  EUROPEAN JOURNAL OF OPERATIONAL , 1992 .

[3]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[4]  D. Massey American Apartheid: Segregation and the Making of the Underclass , 1993 .

[5]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[6]  Rajeev Motwani,et al.  Randomized algorithms , 1996, CSUR.

[7]  Robert Tibshirani,et al.  Classification by Pairwise Coupling , 1997, NIPS.

[8]  Leonidas J. Guibas,et al.  A metric for distributions with applications to image databases , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[9]  Rakesh Agrawal,et al.  Privacy-preserving data mining , 2000, SIGMOD 2000.

[10]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[11]  Joachim Diederich,et al.  Learning-Based Rule-Extraction From Support Vector Machines: Performance On Benchmark Data Sets , 2004 .

[12]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[13]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[14]  Achim Zeileis,et al.  BMC Bioinformatics BioMed Central Methodology article Conditional variable importance for random forests , 2008 .

[15]  John C. Mitchell,et al.  State of the Art: Automated Black-Box Web Application Vulnerability Testing , 2010, 2010 IEEE Symposium on Security and Privacy.

[16]  Leakage in data mining: Formulation, detection, and avoidance , 2012, TKDD.

[17]  Orlando P. Zacarias,et al.  Comparing support vector regression and random forests for predicting malaria incidence in Mozambique , 2013, 2013 International Conference on Advances in ICT for Emerging Regions (ICTer).

[18]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Cynthia Rudin,et al.  Supersparse Linear Integer Models for Interpretable Classification , 2013, 1306.6677.

[20]  Salvatore Ruggieri,et al.  A multidisciplinary survey on discrimination analysis , 2013, The Knowledge Engineering Review.

[21]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[22]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[23]  N. Diakopoulos Algorithmic Accountability Reporting: On the Investigation of Black Boxes , 2014 .

[24]  Carlos Eduardo Scheidegger,et al.  Certifying and Removing Disparate Impact , 2014, KDD.

[25]  Cynthia Rudin,et al.  Interpretable classification models for recidivism prediction , 2015, 1503.07810.

[26]  Kristin Branson,et al.  Understanding classifier errors by examining influential neighbors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[28]  Paul Raccuglia,et al.  Machine-learning-assisted materials discovery using failed experiments , 2016, Nature.

[29]  Yair Zick,et al.  Algorithmic Transparency via Quantitative Input Influence: Theory and Experiments with Learning Systems , 2016, 2016 IEEE Symposium on Security and Privacy (SP).