Fairness implications of encoding protected categorical attributes

Protected attributes are often presented as categorical features that need to be encoded before feeding them into a machine learning algorithm. Encoding these attributes is paramount as they determine the way the algorithm will learn from the data. Categorical feature encoding has a direct impact on the model performance and fairness. In this work, we compare the accuracy and fairness implications of the two most well-known encoders: one-hot encoding and target encoding. We distinguish between two types of induced bias that can arise while using these encodings and can lead to unfair models. The first type, irreducible bias, is due to direct group category discrimination and a second type, reducible bias, is due to large variance in less statistically represented groups. We take a deeper look into how regularization methods for target encoding can improve the induced bias while encoding categorical features. Furthermore, we tackle the problem of intersectional fairness that arises when mixing two protected categorical features leading to higher cardinality. This practice is a powerful feature engineering technique used for boosting model performance. We study its implications on fairness as it can increase both types of induced bias.

[1]  A. Davis Black Feminist Thought: Knowledge, Consciousness and the Politics of Empowerment , 1993 .

[2]  Christopher T. Lowenkamp,et al.  False Positives, False Negatives, and False Analyses: A Rejoinder to "Machine Bias: There's Software Used across the Country to Predict Future Criminals. and It's Biased against Blacks" , 2016 .

[3]  Chris Russell,et al.  Why Fairness Cannot Be Automated: Bridging the Gap Between EU Non-Discrimination Law and AI , 2020, Comput. Law Secur. Rev..

[4]  Salvatore Ruggieri,et al.  A multidisciplinary survey on discrimination analysis , 2013, The Knowledge Engineering Review.

[5]  Kristina Lerman,et al.  A Survey on Bias and Fairness in Machine Learning , 2019, ACM Comput. Surv..

[6]  Gerhard Tutz,et al.  Regression for Categorical Data , 2011 .

[7]  Răzvan Viorescu 2018 REFORM OF EU DATA PROTECTION RULES , 2017 .

[8]  Avi Feller,et al.  Algorithmic Decision Making and the Cost of Fairness , 2017, KDD.

[9]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[10]  Andrew D. Selbst,et al.  Big Data's Disparate Impact , 2016 .

[11]  Natalia Kovalyova,et al.  Data feminism , 2020, Information, Communication & Society.

[12]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[13]  Anna Veronika Dorogush,et al.  CatBoost: unbiased boosting with categorical features , 2017, NeurIPS.

[14]  Krishna P. Gummadi,et al.  Fairness Behind a Veil of Ignorance: A Welfare Analysis for Automated Decision Making , 2018, NeurIPS.

[15]  Julia Stoyanovich,et al.  Causal Intersectionality and Fair Ranking , 2021, FORC.

[16]  Toniann Pitassi,et al.  Fairness through awareness , 2011, ITCS '12.

[17]  Rayid Ghani,et al.  Empirical observation of negligible fairness–accuracy trade-offs in machine learning for public policy , 2020, Nature Machine Intelligence.

[18]  Indre Zliobaite,et al.  Using sensitive personal data may be necessary for avoiding discrimination in data-driven decision models , 2016, Artificial Intelligence and Law.

[19]  Gabriel Erion,et al.  Explainable AI for Trees: From Local Explanations to Global Understanding , 2019, ArXiv.

[20]  Shinichiro Taguchi,et al.  Efficient partition of integer optimization problems with one-hot encoding , 2019, Scientific Reports.

[21]  Linda F. Wightman LSAC National Longitudinal Bar Passage Study. LSAC Research Report Series. , 1998 .

[22]  Matt J. Kusner,et al.  Counterfactual Fairness , 2017, NIPS.

[23]  Seth Neel,et al.  An Empirical Study of Rich Subgroup Fairness for Machine Learning , 2018, FAT.

[24]  Austin Slakey,et al.  Encoding Categorical Variables with Conjugate Bayesian Models for WeWork Lead Scoring Engine , 2019, ArXiv.

[25]  Krishna P. Gummadi,et al.  Fairness Constraints: Mechanisms for Fair Classification , 2015, AISTATS.

[26]  Seth Neel,et al.  Preventing Fairness Gerrymandering: Auditing and Learning for Subgroup Fairness , 2017, ICML.

[27]  Ana-Andreea Stoica,et al.  Bridging Machine Learning and Mechanism Design towards Algorithmic Fairness , 2021, FAccT.

[28]  Daniele Micci-Barreca,et al.  A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems , 2001, SKDD.

[29]  Florian Pargent A Benchmark Experiment on How to Encode Categorical Features in Predictive Modeling , 2019 .

[30]  James R. Foulds,et al.  An Intersectional Definition of Fairness , 2018, 2020 IEEE 36th International Conference on Data Engineering (ICDE).

[31]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[32]  Suresh Venkatasubramanian,et al.  A comparative study of fairness-enhancing interventions in machine learning , 2018, FAT.

[33]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[34]  Krishna P. Gummadi,et al.  Fairness Beyond Disparate Treatment & Disparate Impact: Learning Classification without Disparate Mistreatment , 2016, WWW.

[35]  Rich Caruana,et al.  An empirical comparison of supervised learning algorithms , 2006, ICML.

[36]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[37]  Nathan Srebro,et al.  Equality of Opportunity in Supervised Learning , 2016, NIPS.

[38]  Jeffrey M. Wooldridge,et al.  Introductory Econometrics: A Modern Approach , 1999 .

[39]  Yarin Gal,et al.  Uncertainty in Deep Learning , 2016 .

[40]  Bernd Bischl,et al.  Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features , 2021, ArXiv.

[41]  Anna Veronika Dorogush,et al.  CatBoost: gradient boosting with categorical features support , 2018, ArXiv.

[42]  A. Kiureghian,et al.  Aleatory or epistemic? Does it matter? , 2009 .

[43]  Thomas Gottron,et al.  Desiderata for Explainable AI in statistical production systems of the European Central Bank , 2021, ArXiv.

[44]  Taghi M. Khoshgoftaar,et al.  CatBoost for big data: an interdisciplinary review , 2020, J. Big Data.

[45]  David Masip,et al.  Quantile Encoder: Tackling High Cardinality Categorical Features in Regression Problems , 2021, MDAI.

[46]  Krishna P. Gummadi,et al.  From Parity to Preference-based Notions of Fairness in Classification , 2017, NIPS.

[47]  Sergio Escalera,et al.  Beyond One-hot Encoding: lower dimensional target embedding , 2018, Image Vis. Comput..

[48]  B. Roe,et al.  Boosted decision trees as an alternative to artificial neural networks for particle identification , 2004, physics/0408124.