Removing the influence of group variables in high‐dimensional predictive modelling

In many application areas, predictive models are used to support or make important decisions. There is increasing awareness that these models may contain spurious or otherwise undesirable correlations. Such correlations may arise from a variety of sources, including batch effects, systematic measurement errors, or sampling bias. Without explicit adjustment, machine learning algorithms trained using these data can produce poor out-of-sample predictions which propagate these undesirable correlations. We propose a method to pre-process the training data, producing an adjusted dataset that is statistically independent of the nuisance variables with minimum information loss. We develop a conceptually simple approach for creating an adjusted dataset in high-dimensional settings based on a constrained form of matrix decomposition. The resulting dataset can then be used in any predictive algorithm with the guarantee that predictions will be statistically independent of the group variable. We develop a scalable algorithm for implementing the method, along with theory support in the form of independence guarantees and optimality. The method is illustrated on some simulation examples and applied to two case studies: removing machine-specific correlations from brain scan data, and removing race and ethnicity information from a dataset used to predict recidivism. That the motivation for removing undesirable correlations is quite different in the two applications illustrates the broad applicability of our approach.

[1]  Anders M. Dale,et al.  An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest , 2006, NeuroImage.

[2]  C. Umilta,et al.  The use of transcranial magnetic stimulation in cognitive neuroscience: A new synthesis of methodological issues , 2011, Neuroscience & Biobehavioral Reviews.

[3]  Kristian Lum,et al.  A statistical framework for fair predictive algorithms , 2016, ArXiv.

[4]  A. James Normal Multivariate Analysis and the Orthogonal Group , 1954 .

[5]  M. Kearns,et al.  Fairness in Criminal Justice Risk Assessments: The State of the Art , 2017, Sociological Methods & Research.

[6]  Zhengwu Zhang,et al.  Relationships between Human Brain Structural Connectomes and Traits , 2018, bioRxiv.

[7]  Joel S. Parker,et al.  Adjustment of systematic microarray data biases , 2004, Bioinform..

[8]  Devin G. Pope,et al.  Implementing Anti-discrimination Policies in Statistical Profiling Models , 2011 .

[9]  A. Mayer,et al.  Enhanced cue reactivity and fronto-striatal functional connectivity in cocaine use disorders. , 2011, Drug and alcohol dependence.

[10]  Anne Beck,et al.  Effect of brain structure, brain function, and brain connectivity on relapse in alcohol-dependent patients. , 2012, Archives of general psychiatry.

[11]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[12]  Marcus A. Badgeley,et al.  Confounding variables can degrade generalization performance of radiological deep learning models , 2018, ArXiv.

[13]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[14]  Nathan Srebro,et al.  Equality of Opportunity in Supervised Learning , 2016, NIPS.

[15]  Carlos Eduardo Scheidegger,et al.  Certifying and Removing Disparate Impact , 2014, KDD.

[16]  Tamara G. Kolda,et al.  Tensor Decompositions and Applications , 2009, SIAM Rev..

[17]  Christine L. Cox,et al.  Reduced Interhemispheric Resting State Functional Connectivity in Cocaine Addiction , 2011, Biological Psychiatry.

[18]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[19]  COMPAS Risk Scales : Demonstrating Accuracy Equity and Predictive Parity Performance of the COMPAS Risk Scales in Broward County , 2016 .

[20]  Andrew E. Jaffe,et al.  Bioinformatics Applications Note Gene Expression the Sva Package for Removing Batch Effects and Other Unwanted Variation in High-throughput Experiments , 2022 .

[21]  John D. Storey,et al.  Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis , 2007, PLoS genetics.

[22]  Toon Calders,et al.  Data preprocessing techniques for classification without discrimination , 2011, Knowledge and Information Systems.

[23]  Marcus A. Badgeley,et al.  Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study , 2018, PLoS medicine.

[24]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Kristian Lum,et al.  An algorithm for removing sensitive information: Application to race-independent recidivism prediction , 2017, The Annals of Applied Statistics.

[26]  I. Jolliffe,et al.  A Modified Principal Component Technique Based on the LASSO , 2003 .

[27]  R. Tibshirani,et al.  A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. , 2009, Biostatistics.

[28]  Toon Calders,et al.  Classifying without discriminating , 2009, 2009 2nd International Conference on Computer, Control and Communication.

[29]  Kristian Lum,et al.  Limitations of mitigating judicial bias with machine learning , 2017, Nature Human Behaviour.

[30]  Jerome P. Reiter,et al.  Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study , 2005 .

[31]  Mark Jenkinson,et al.  The minimal preprocessing pipelines for the Human Connectome Project , 2013, NeuroImage.

[32]  Blake Lemoine,et al.  Mitigating Unwanted Biases with Adversarial Learning , 2018, AIES.

[33]  Suresh Venkatasubramanian,et al.  Auditing black-box models for indirect influence , 2016, Knowledge and Information Systems.

[34]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[35]  David Rudovsky,et al.  Law Enforcement by Stereotypes and Serendipity: Racial Profiling and Stops and Searches Without Cause , 2001 .

[36]  Hugues Bersini,et al.  Batch effect removal methods for microarray gene expression data integration: a survey , 2013, Briefings Bioinform..

[37]  Thomas E. Nichols,et al.  Thresholding of Statistical Maps in Functional Neuroimaging Using the False Discovery Rate , 2002, NeuroImage.

[38]  Christopher T. Lowenkamp,et al.  False Positives, False Negatives, and False Analyses: A Rejoinder to "Machine Bias: There's Software Used across the Country to Predict Future Criminals. and It's Biased against Blacks" , 2016 .

[39]  Gene H. Golub,et al.  Matrix computations , 1983 .

[40]  Robert D. Crutchfield,et al.  Law, Social Standing and Racial Disparities in Imprisonment , 1988 .

[41]  Z. Obermeyer,et al.  Predicting the Future - Big Data, Machine Learning, and Clinical Medicine. , 2016, The New England journal of medicine.

[42]  Steen Moeller,et al.  The Human Connectome Project's neuroimaging approach , 2016, Nature Neuroscience.

[43]  I. Johnstone On the distribution of the largest eigenvalue in principal components analysis , 2001 .

[44]  Ronald M. Summers,et al.  Machine learning and radiology , 2012, Medical Image Anal..

[45]  Daniel Eriksson,et al.  Orthogonal projections to latent structures as a strategy for microarray data normalization , 2007, BMC Bioinformatics.

[46]  Daniele Durante,et al.  Bayesian Inference and Testing of Group Differences in Brain Networks , 2014, 1411.6506.

[47]  C. O’Brien Statistical Learning with Sparsity: The Lasso and Generalizations , 2016 .

[48]  Sharad Goel,et al.  The Problem of Infra-Marginality in Outcome Tests for Discrimination , 2016, 1607.05376.

[49]  Sharad Goel,et al.  The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning , 2018, ArXiv.

[50]  David B. Dunson,et al.  Statistics in the big data era: Failures of the machine , 2018 .

[51]  Solon Barocas,et al.  Prediction-Based Decisions and Fairness: A Catalogue of Choices, Assumptions, and Definitions , 2018, 1811.07867.

[52]  Rita Z. Goldstein,et al.  The Neurocircuitry of Impaired Insight in Drug Addiction , 2009, Trends in Cognitive Sciences.

[53]  Krishna P. Gummadi,et al.  Fairness Beyond Disparate Treatment & Disparate Impact: Learning Classification without Disparate Mistreatment , 2016, WWW.

[54]  Chris Piech,et al.  Achieving Fairness through Adversarial Learning: an Application to Recidivism Prediction , 2018, ArXiv.

[55]  Jerome P. Reiter,et al.  Multiple Imputation for Statistical Disclosure Limitation , 2003 .