FAWOS: Fairness-Aware Oversampling Algorithm Based on Distributions of Sensitive Attributes

With the increased use of machine learning algorithms to make decisions which impact people’s lives, it is of extreme importance to ensure that predictions do not prejudice subgroups of the population with respect to sensitive attributes such as race or gender. Discrimination occurs when the probability of a positive outcome changes across privileged and unprivileged groups defined by the sensitive attributes. It has been shown that this bias can be originated from imbalanced data contexts where one of the classes contains a much smaller number of instances than the other classes. It is also important to identify the nature of the imbalanced data, including the characteristics of the minority classes’ distribution. This paper presents FAWOS: a Fairness-Aware oversampling algorithm which aims to attenuate unfair treatment by handling sensitive attributes’ imbalance. We categorize different types of datapoints according to their local neighbourhood with respect to the sensitive attributes, identifying which are more difficult to learn by the classifiers. In order to balance the dataset, FAWOS oversamples the training data by creating new synthetic datapoints using the different types of datapoints identified. We test the impact of FAWOS on different learning classifiers and analyze which can better handle sensitive attribute imbalance. Empirically, we observe that this algorithm can effectively increase the fairness results of the classifiers while not neglecting the classification performance. Source code can be found at: https://github.com/teresalazar13/FAWOS

[1]  Carlos Eduardo Scheidegger,et al.  Certifying and Removing Disparate Impact , 2014, KDD.

[2]  Qing Li,et al.  Adaptive weighted over-sampling for imbalanced datasets based on density peaks clustering with heuristic filtering , 2020, Inf. Sci..

[3]  J. S. Cramer The Origins of Logistic Regression , 2002 .

[4]  Bartosz Krawczyk,et al.  Radial-Based Approach to Imbalanced Data Oversampling , 2017, HAIS.

[5]  Silvia Chiappa,et al.  Path-Specific Counterfactual Fairness , 2018, AAAI.

[6]  Alan Mishler,et al.  Fairness in Risk Assessment Instruments: Post-Processing to Achieve Counterfactual Equalized Odds , 2020, FAccT.

[7]  Kristina Lerman,et al.  A Survey on Bias and Fairness in Machine Learning , 2019, ACM Comput. Surv..

[8]  Miriam Seoane Santos,et al.  Cross-Validation for Imbalanced Datasets: Avoiding Overoptimistic and Overfitting Approaches [Research Frontier] , 2018, IEEE Computational Intelligence Magazine.

[9]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[10]  Eirini Ntoutsi,et al.  AdaFair , 2019, Proceedings of the 28th ACM International Conference on Information and Knowledge Management.

[11]  Bartosz Krawczyk,et al.  Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets , 2016, Pattern Recognit..

[12]  D. Karlen The Supreme Court of the United States , 1962 .

[13]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[14]  Yizhou Sun,et al.  Learning Fair Representations via an Adversarial Framework , 2019, ArXiv.

[15]  Krishna P. Gummadi,et al.  Fairness Constraints: Mechanisms for Fair Classification , 2015, AISTATS.

[16]  Toon Calders,et al.  Classifying without discriminating , 2009, 2009 2nd International Conference on Computer, Control and Communication.

[17]  Toon Calders,et al.  Three naive Bayes approaches for discrimination-free classification , 2010, Data Mining and Knowledge Discovery.

[18]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[19]  Szymon Wilk,et al.  Learning from Imbalanced Data in Presence of Noisy and Borderline Examples , 2010, RSCTC.

[20]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[21]  Reid A. Johnson,et al.  Calibrating Probability with Undersampling for Unbalanced Classification , 2015, 2015 IEEE Symposium Series on Computational Intelligence.

[22]  Qing He,et al.  Real-value negative selection over-sampling for imbalanced data set learning , 2019, Expert Syst. Appl..

[23]  Kush R. Varshney,et al.  Optimized Pre-Processing for Discrimination Prevention , 2017, NIPS.

[24]  Blake Lemoine,et al.  Mitigating Unwanted Biases with Adversarial Learning , 2018, AIES.

[25]  Sahil Verma,et al.  Removing biased data to improve fairness and accuracy , 2021, ArXiv.

[26]  Jerzy Stefanowski,et al.  Types of minority class examples and their influence on learning classifiers from imbalanced data , 2015, Journal of Intelligent Information Systems.

[27]  Suresh Venkatasubramanian,et al.  A comparative study of fairness-enhancing interventions in machine learning , 2018, FAT.

[28]  Pranjal Awasthi,et al.  Equalized odds postprocessing under imperfect group information , 2019, AISTATS.

[29]  Chih-Fong Tsai,et al.  Under-sampling class imbalanced datasets by combining clustering analysis and instance selection , 2019, Inf. Sci..

[30]  Suyun Liu,et al.  Accuracy and Fairness Trade-offs in Machine Learning: A Stochastic Multi-Objective Approach , 2020, ArXiv.

[31]  Nathan Srebro,et al.  Equality of Opportunity in Supervised Learning , 2016, NIPS.

[32]  Berk Ustun,et al.  Repairing without Retraining: Avoiding Disparate Impact with Counterfactual Distributions , 2019, ICML.

[33]  Debahuti Mishra,et al.  Handling Imbalanced Data: A Survey , 2018 .

[34]  Toon Calders,et al.  Data preprocessing techniques for classification without discrimination , 2011, Knowledge and Information Systems.

[35]  J. L. Hodges,et al.  Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties , 1989 .

[36]  Tony R. Martinez,et al.  Improved Heterogeneous Distance Functions , 1996, J. Artif. Intell. Res..

[37]  Timnit Gebru,et al.  Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification , 2018, FAT.

[38]  Nisheeth K. Vishnoi,et al.  How to be Fair and Diverse? , 2016, ArXiv.

[39]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..