It's COMPASlicated: The Messy Relationship between RAI Datasets and Algorithmic Fairness Benchmarks

Risk assessment instrument (RAI) datasets, particularly ProPublica’s COMPAS dataset, are commonly used in algorithmic fairness papers due to benchmarking practices of comparing algorithms on datasets used in prior work. In many cases, this data is used as a benchmark to demonstrate good performance without accounting for the complexities of criminal justice (CJ) processes. However, we show that pretrial RAI datasets can contain numerous measurement biases and errors, and due to disparities in discretion and deployment, algorithmic fairness applied to RAI datasets is limited in making claims about real-world outcomes. These reasons make the datasets a poor fit for benchmarking under assumptions of ground truth and real-world impact. Furthermore, conventional practices of simply replicating previous data experiments may implicitly inherit or edify normative positions without explicitly interrogating value-laden assumptions. Without context of how interdisciplinary fields have engaged in CJ research and context of how RAIs operate upstream and downstream, algorithmic fairness practices are misaligned for meaningful contribution in the context of CJ, and would benefit from transparent engagement with normative considerations and values related to fairness, justice, and equality. These factors prompt questions about whether benchmarks for intrinsically socio-technical systems like the CJ system can exist in a beneficial and ethical way.

[1]  Alexandra Chouldechova,et al.  Does mitigating ML's impact disparity require treatment disparity? , 2017, NeurIPS.

[2]  J. Viljoen,et al.  Impact of risk assessment instruments on rates of pretrial detention, postconviction placements, and release: A systematic review and meta-analysis. , 2019, Law and human behavior.

[3]  S. D. Gottfredson,et al.  Statistical Risk Assessment: Old Problems and New Applications , 2006 .

[4]  A Primer on Risk Assessment for Legal Decisionmakers , 2021 .

[5]  Suresh Venkatasubramanian,et al.  Runaway Feedback Loops in Predictive Policing , 2017, FAT.

[6]  J. Monahan,et al.  The evolution of violence risk assessment , 2014, CNS Spectrums.

[7]  Christopher T. Lowenkamp,et al.  Revalidating the Federal Pretrial Risk Assessment Instrument (PTRA): A Research Summary , 2018 .

[8]  R. Subramanian,et al.  Incarceration’s Front Door: The Misuse of Jails in America , 2015 .

[9]  Reuben Binns,et al.  Fairness in Machine Learning: Lessons from Political Philosophy , 2017, FAT.

[10]  David B. Dunson,et al.  Closer than they appear: A Bayesian perspective on individual‐level heterogeneity in risk assessment , 2021, Journal of the Royal Statistical Society: Series A (Statistics in Society).

[11]  Ina Ruck,et al.  USA , 1969, The Lancet.

[12]  Jure Leskovec,et al.  Human Decisions and Machine Predictions , 2017, The quarterly journal of economics.

[13]  J. Viljoen,et al.  Racist Algorithms or Systemic Problems? Risk Assessments and Racial Disparities , 2020 .

[14]  Matias Barenstein,et al.  ProPublica's COMPAS Data Revisited , 2019, ArXiv.

[15]  Aleksander Madry,et al.  From ImageNet to Image Classification: Contextualizing Progress on Benchmarks , 2020, ICML.

[16]  Sarah Desmarais Jay Singh,et al.  Risk Assessment Instruments Validated and Implemented in Correctional Settings in the United States , 2013 .

[17]  K. Lum,et al.  To predict and serve? , 2016 .

[18]  Inioluwa Deborah Raji,et al.  Model Cards for Model Reporting , 2018, FAT.

[19]  Alexandra Chouldechova,et al.  On the Validity of Arrest as a Proxy for Offense: Race and the Likelihood of Arrest for Violent Crimes , 2021, AIES.

[20]  Ed H. Chi,et al.  Fairness without Demographics through Adversarially Reweighted Learning , 2020, NeurIPS.

[21]  Identifying the Predictors of Pretrial Failure : A Meta-Analysis , 2017 .

[22]  Alex Albright,et al.  IF YOU GIVE A JUDGE A RISK SCORE: EVIDENCE FROM KENTUCKY BAIL DECISIONS , 2019 .

[23]  K. Grimm,et al.  Proximal Risk Factors for Short-Term Community Violence Among Adults With Mental Illnesses. , 2016, Psychiatric services.

[24]  Percy Liang,et al.  Feature Noise Induces Loss Discrepancy Across Groups , 2020, ICML.

[25]  M. Stevenson,et al.  Assessing Risk Assessment in Action , 2018 .

[26]  Krishna P. Gummadi,et al.  From Parity to Preference-based Notions of Fairness in Classification , 2017, NIPS.

[27]  Beth A. Colgan,et al.  Prison Abolition and Grounded Justice , 2015 .

[28]  D. Kehl,et al.  Algorithms in the Criminal Justice System: Assessing the Use of Risk Assessments in Sentencing , 2017 .

[29]  Sarah L. Desmarais,et al.  Reliability and Validity of START and LSI-R Assessments in Mental Health Jail Diversion Clients , 2019, Assessment.

[30]  Ben Green,et al.  The Myth in the Methodology: Towards a Recontextualization of Fairness in Machine Learning , 2018, ICML 2018.

[31]  Sarah L. Desmarais,et al.  Racial Bias and LSI-R Assessments in Probation Sentencing and Outcomes , 2018, Criminal Justice and Behavior.

[32]  Rumi Chunara,et al.  Fairness Violations and Mitigation under Covariate Shift , 2019, FAccT.

[33]  Sandra G. Mayson Bias In, Bias Out , 2018 .

[34]  Nagarajan Natarajan,et al.  Learning with Noisy Labels , 2013, NIPS.

[35]  Jeffrey Lin,et al.  Supervision Intensity and Parole Outcomes: A Competing Risks Approach to Criminal and Technical Parole Violations , 2016 .

[36]  Danah Boyd,et al.  Fairness and Abstraction in Sociotechnical Systems , 2019, FAT.

[37]  Christopher T. Lowenkamp,et al.  Using Algorithms to Address Trade-Offs Inherent in Predicting Recidivism , 2020, Behavioral sciences & the law.

[38]  David Thornton,et al.  Predicting Recidivism Amongst Sexual Offenders: A Multi-site Study of Static-2002 , 2010, Law and human behavior.

[39]  Timnit Gebru,et al.  Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification , 2018, FAT.

[40]  Jeffrey Lin,et al.  SUPERVISION REGIMES, RISK, AND OFFICIAL REACTIONS TO PAROLEE DEVIANCE* , 2011 .

[41]  Avi Feller,et al.  Bayesian Sensitivity Analysis for Offline Policy Evaluation , 2020, AIES.

[42]  K. Douglas,et al.  Handbook of violence risk assessment. , 2010 .

[43]  K. Babchishin,et al.  Primer on Risk Assessment and the Statistics Used to Evaluate Its Accuracy , 2017 .

[44]  Nathan Kallus,et al.  Residual Unfairness in Fair Machine Learning from Prejudiced Data , 2018, ICML.

[45]  Brandon L. Garrett,et al.  Judicial appraisals of risk assessment in sentencing. , 2018, Behavioral sciences & the law.

[46]  Yang Liu,et al.  Fair Classification with Group-Dependent Label Noise , 2021, FAccT.

[47]  Alan J. Tomkins,et al.  Reducing courts' failure-to-appear rate by written reminders , 2013 .

[48]  Alexandra Chouldechova,et al.  The effect of differential victim crime reporting on predictive policing systems , 2021, FAccT.

[49]  G. Kleck,et al.  What methods are most frequently used in research in criminology and criminal justice , 2006 .

[50]  Suresh Venkatasubramanian,et al.  A comparative study of fairness-enhancing interventions in machine learning , 2018, FAT.

[51]  J. Thompson,et al.  Issues in bioinformatics benchmarking: the case study of multiple sequence alignment , 2010, Nucleic acids research.

[52]  Timnit Gebru,et al.  Datasheets for datasets , 2018, Commun. ACM.

[53]  S. B. Baughman Costs of Pretrial Detention , 2016 .

[54]  Kristian Lum,et al.  The impact of overbooking on a pre-trial risk assessment tool , 2020, FAT*.

[55]  Daniel Kuhn,et al.  Distributionally Robust Logistic Regression , 2015, NIPS.

[56]  J. Viljoen,et al.  Risk assessments for violence and reoffending: Implementation and impact on risk management. , 2020 .

[57]  Alexandra Chouldechova,et al.  Fairness Evaluation in Presence of Biased Noisy Labels , 2020, AISTATS.

[58]  John Monahan,et al.  Judging the Use of Risk Assessment in Sentencing , 2019 .

[59]  S. Fazel,et al.  Risk factors for recidivism in individuals receiving community sentences: a systematic review and meta-analysis , 2019, CNS Spectrums.

[60]  Stephen J Tueller,et al.  Methodological limitations in the measurement and statistical modeling of violence among adults with mental illness , 2019, International journal of methods in psychiatric research.

[61]  Alexandra Chouldechova,et al.  Counterfactual risk assessments, evaluation, and fairness , 2020, FAT*.

[62]  Samantha A. Zottola,et al.  Predictive Validity of Pretrial Risk Assessments: A Systematic Review of the Literature , 2020, Criminal Justice and Behavior.

[63]  Madeleine Udell,et al.  Fairness Under Unawareness: Assessing Disparity When Protected Class Is Unobserved , 2018, FAT.

[64]  Gordana Rajlic,et al.  An Examination of Two Sexual Recidivism Risk Measures in Adolescent Offenders , 2010 .

[65]  Samuel B. Williams,et al.  ASSOCIATION FOR COMPUTING MACHINERY , 2000 .

[66]  Jennifer L. Skeem,et al.  Risk Redux: The Resurgence of Risk Assessment in Criminal Sanctioning , 2014 .

[67]  E. Mulvey,et al.  Reporting guidance for violence risk assessment predictive validity studies: the RAGEE Statement. , 2015, Law and human behavior.

[68]  Emily Denton,et al.  Towards a critical race methodology in algorithmic fairness , 2019, FAT*.

[69]  J. Mills,et al.  Impression Management and Self-Report Among Violent Offenders , 2006, Journal of interpersonal violence.

[70]  R. Berk An introduction to sample selection bias in sociological data. , 1983 .

[71]  Jay P Singh,et al.  Predictive validity performance indicators in violence risk assessment: a methodological primer. , 2013, Behavioral sciences & the law.

[72]  Catherine A. Cormier,et al.  Violent offenders: Appraising and managing risk, 2nd ed. , 2006 .

[73]  Jacob Cohen,et al.  Statistical Power Analysis For The Behavioral Sciences Revised Edition , 1987 .

[74]  J. S. Wormith,et al.  Handbook of Recidivism Risk/Needs Assessment Tools , 2018 .

[75]  Sharad Goel,et al.  The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning , 2018, ArXiv.

[76]  Jon M. Kleinberg,et al.  Inherent Trade-Offs in the Fair Determination of Risk Scores , 2016, ITCS.

[77]  Jonas Mueller,et al.  Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks , 2021, NeurIPS Datasets and Benchmarks.

[78]  R. Farley,et al.  Homicide trends in the United States , 1980, Demography.

[79]  J. Monahan,et al.  Science Current Directions in Psychological Current Directions in Violence Risk Assessment on Behalf Of: Association for Psychological Science , 2022 .

[80]  Cynthia E. Jones "Give Us Free": Addressing Racial Disparities in Bail Determinations , 2013 .

[81]  Suresh Venkatasubramanian,et al.  The (Im)possibility of fairness , 2016, Commun. ACM.

[82]  S. Luebbers,et al.  Cross-cultural reliability and rater bias in forensic risk assessment: a review of the literature , 2020, Psychology, Crime & Law.

[83]  Youngjae Lee,et al.  Dangerous Defendants , 2019 .

[84]  Nathan Kallus,et al.  The Fairness of Risk Scores Beyond Classification: Bipartite Ranking and the xAUC Metric , 2019, NeurIPS.

[85]  Shira Mitchell,et al.  Algorithmic Fairness: Choices, Assumptions, and Definitions , 2021, Annual Review of Statistics and Its Application.

[86]  Traci Burch,et al.  The First Civil Right: How Liberals Built Prison America , 2015 .

[87]  Crystal S. Yang,et al.  The Effects of Pre-Trial Detention on Conviction, Future Crime, and Employment: Evidence from Randomly Assigned Judges , 2016 .

[88]  Andrew M. Cuomo,et al.  New York State COMPAS-Probation Risk and Need Assessment Study: Examining the Recidivism Scale's Effectiveness and Predictive Accuracy , 2012 .

[89]  Bernard E. Harcourt,et al.  Risk as a Proxy for Race , 2010 .

[90]  Ben Green,et al.  The false promise of risk assessments: epistemic reform and the limits of fairness , 2020, FAT*.

[91]  Hanna M. Wallach,et al.  Measurement and Fairness , 2019, FAccT.

[92]  Peter Baumgartner,et al.  Public safety assessment , 2020, Criminology & Public Policy.

[93]  Anuj K. Shah,et al.  Behavioral nudges reduce failure to appear for court , 2020, Science.

[94]  Aditya Krishna Menon,et al.  Noise-tolerant fair classification , 2019, NeurIPS.

[95]  Berk Ustun,et al.  Predictive Multiplicity in Classification , 2020, ICML.

[96]  Brian D. Johnson,et al.  Is the Magic Still There? The Use of the Heckman Two-Step Correction for Selection Bias in Criminology , 2007 .

[97]  Erica L. Smith,et al.  Homicide Trends in the United States, 1980-2008: Annual Rates for 2009 and 2010 , 2011 .

[98]  Michael I. Jordan,et al.  Robust Optimization for Fairness with Noisy Protected Groups , 2020, NeurIPS.

[99]  A Jurisprudence of Dangerousness , 2003 .

[100]  A. Crisanti,et al.  A review of the validity of self-reported arrests among persons with mental illness , 2003 .

[101]  The causal impact of bail on case outcomes for indigent defendants , 2017, 1707.04666.

[102]  Emily Denton,et al.  Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure , 2020, FAccT.

[103]  Pranjal Awasthi,et al.  Equalized odds postprocessing under imperfect group information , 2019, AISTATS.

[104]  Alexandra Chouldechova,et al.  Fair prediction with disparate impact: A study of bias in recidivism prediction instruments , 2016, Big Data.

[105]  Christopher T. Lowenkamp,et al.  RISK, RACE, AND RECIDIVISM: PREDICTIVE BIAS AND DISPARATE IMPACT*: RISK, RACE, AND RECIDIVISM , 2016 .

[106]  Sarah L Desmarais,et al.  Performance of recidivism risk assessment instruments in U.S. correctional settings. , 2016, Psychological services.

[107]  Gina M Vincent,et al.  Does risk assessment make a difference? Results of implementing the SAVRY in juvenile probation. , 2012, Behavioral sciences & the law.