Detecting Discrimination Risk in Automated Decision-Making Systems with Balance Measures on Input Data

Bias in the data used to train decision-making systems is a relevant socio-technical issue that emerged in recent years, and it still lacks a commonly accepted solution. Indeed, the "bias in-bias out" problem represents one of the most significant risks of discrimination, which encompasses technical fields, as well as ethical and social perspectives. We contribute to the current studies of the issue by proposing a data quality measurement approach combined with risk management, both defined in ISO/IEC standards. For this purpose, we investigate imbalance in a given dataset as a potential risk factor for detecting discrimination in the classification outcome: specifically, we aim to evaluate whether it is possible to identify the risk of bias in a classification output by measuring the level of (im)balance in the input data. We select four balance measures (the Gini, Shannon, Simpson, and Imbalance ratio indexes) and we test their capability to identify discriminatory classification outputs by applying such measures to protected attributes in the training set. The results of this analysis show that the proposed approach is suitable for the goal highlighted above: the balance measures properly detect unfairness of software output, even though the choice of the index has a relevant impact on the detection of discriminatory outcomes, therefore further work is required to test more in-depth the reliability of the balance measures as risk indicators. We believe that our approach for assessing the risk of discrimination should encourage to take more conscious and appropriate actions, as well as to prevent adverse effects caused by the "bias in-bias out" problem.

[1]  Marco Torchiano,et al.  A data quality approach to the identification of discrimination risk in automated decision making systems , 2021, Gov. Inf. Q..

[2]  Juan Carlos De Martin,et al.  Detecting discriminatory risk through data annotation based on Bayesian inferences , 2021, FAccT.

[3]  Michelle Seng Ah Lee,et al.  The Landscape and Gaps in Open Source Fairness Toolkits , 2020, CHI.

[4]  Evaggelia Pitoura,et al.  Social-minded Measures of Data Quality , 2020, ACM J. Data Inf. Qual..

[5]  Arisa Ema,et al.  RCModel, a Risk Chain Model for Risk Reduction in AI Services , 2020, ArXiv.

[6]  Letizia Tanca,et al.  Ethical Dimensions for Data Quality , 2019, ACM J. Data Inf. Qual..

[7]  Ben Hutchinson,et al.  50 Years of Test (Un)fairness: Lessons for Machine Learning , 2018, FAT.

[8]  Hannah Lebovits Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor , 2018, Public Integrity.

[9]  Juan Carlos De Martin,et al.  Ethical and Socially-Aware Data Labels , 2018, SIMBig.

[10]  Bartosz Krawczyk,et al.  Learning from imbalanced data: open challenges and future directions , 2016, Progress in Artificial Intelligence.

[11]  Helen Nissenbaum,et al.  Bias in computer systems , 1996, TOIS.

[12]  A. Vetrò Imbalanced data as risk factor of discriminating automated decisions , 2021 .

[13]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[14]  Andrew D. Selbst,et al.  Big Data's Disparate Impact , 2016 .

[15]  D. A. Dube,et al.  Risk management guidelines , 1992 .