Equity-Directed Bootstrapping: Examples and Analysis

When faced with severely imbalanced binary classification problems, we often train models on bootstrapped data in which the number of instances of each class occur in a more favorable ratio, e.g., one. We view algorithmic inequity through the lens of imbalanced classification: in order to balance the performance of a classifier across groups, we can bootstrap to achieve training sets that are balanced with respect to both labels and group identity. For an example problem with severe class imbalance—prediction of suicide death from administrative patient records— we illustrate how an equity-directed bootstrap can bring test set sensitivities and specificities much closer to satisfying the equal odds criterion. In the context of näıve Bayes and logistic regression, we analyze the equity-directed bootstrap, demonstrating that it works by bringing odds ratios close to one, and linking it to methods involving intercept adjustment, thresholding, and weighting.

[1]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[2]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[3]  Martin Wattenberg,et al.  Ad click prediction: a view from the trenches , 2013, KDD.

[4]  Geoff Gordon,et al.  Inherent Tradeoffs in Learning Fair Representations , 2019, NeurIPS.

[5]  Luís Torgo,et al.  A Survey of Predictive Modeling on Imbalanced Domains , 2016, ACM Comput. Surv..

[6]  Ying Ju,et al.  Finding the Best Classification Threshold in Imbalanced Classification , 2016, Big Data Res..

[7]  Max Kuhn,et al.  Applied Predictive Modeling , 2013 .

[8]  Mamunur Rashid,et al.  Inference on Logistic Regression Models , 2008 .

[9]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[10]  Jing-Hao Xue,et al.  Why Does Rebalancing Class-Unbalanced Data Improve AUC for Linear Discriminant Analysis? , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[12]  Nathan Srebro,et al.  Equality of Opportunity in Supervised Learning , 2016, NIPS.

[13]  Gary King,et al.  Logistic Regression in Rare Events Data , 2001, Political Analysis.

[14]  Sereina Riniker,et al.  GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data in Machine Learning , 2021, J. Chem. Inf. Model..

[15]  Yanqing Zhang,et al.  SVMs Modeling for Highly Imbalanced Classification , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[16]  Gerald Schaefer,et al.  Cost-sensitive decision tree ensembles for effective imbalanced classification , 2014, Appl. Soft Comput..

[17]  HaiYing Wang,et al.  Distributed Logistic Regression for Massive Data with Rare Events , 2023, Statistica Sinica.

[18]  S. Shortreed,et al.  Racial/Ethnic Disparities in the Performance of Prediction Models for Death by Suicide After Mental Health Visits. , 2021, JAMA psychiatry.

[19]  Kristina Lerman,et al.  A Survey on Bias and Fairness in Machine Learning , 2019, ACM Comput. Surv..

[20]  Taghi M. Khoshgoftaar,et al.  Severely imbalanced Big Data challenges: investigating data sampling approaches , 2019, Journal of Big Data.

[21]  Giorgio Valentini,et al.  Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants , 2017, Scientific Reports.

[22]  C. Mackenzie,et al.  A new method of classifying prognostic comorbidity in longitudinal studies: development and validation. , 1987, Journal of chronic diseases.