Investigating Human + Machine Complementarity for Recidivism Predictions

When might human input help (or not) when assessing risk in fairness domains? Dressel and Farid (2018) asked Mechanical Turk workers to evaluate a subset of defendants in the ProPublica COMPAS data for risk of recidivism, and concluded that COMPAS predictions were no more accurate or fair than predictions made by humans. We delve deeper into this claim to explore differences in human and algorithmic decision making. We construct a Human Risk Score based on the predictions made by multiple Turk workers, characterize the features that determine agreement and disagreement between COMPAS and Human Scores, and construct hybrid Human+Machine models to predict recidivism. Our key finding is that on this data set, Human and COMPAS decision making differed, but not in ways that could be leveraged to significantly improve ground-truth prediction. We present the results of our analyses and suggestions for data collection best practices to leverage complementary strengths of human and machines in the fairness domain.

[1]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[2]  Panagiotis G. Ipeirotis,et al.  Quality management on Amazon Mechanical Turk , 2010, HCOMP '10.

[3]  W. Grove,et al.  Clinical versus mechanical prediction: a meta-analysis. , 2000, Psychological assessment.

[4]  Rich Caruana,et al.  Distill-and-Compare: Auditing Black-Box Models Using Transparent Model Distillation , 2017, AIES.

[5]  Christopher T. Lowenkamp,et al.  False Positives, False Negatives, and False Analyses: A Rejoinder to "Machine Bias: There's Software Used across the Country to Predict Future Criminals. and It's Biased against Blacks" , 2016 .

[6]  Rich Caruana,et al.  An empirical comparison of supervised learning algorithms , 2006, ICML.

[7]  Dayong Wang,et al.  Deep Learning for Identifying Metastatic Breast Cancer , 2016, ArXiv.

[8]  Margo I. Seltzer,et al.  Learning Certifiably Optimal Rule Lists , 2017, KDD.

[9]  Johannes Gehrke,et al.  Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission , 2015, KDD.

[10]  Krishna P. Gummadi,et al.  Human Perceptions of Fairness in Algorithmic Decision Making: A Case Study of Criminal Risk Prediction , 2018, WWW.

[11]  P. Gendreau,et al.  A META‐ANALYSIS OF THE PREDICTORS OF ADULT OFFENDER RECIDIVISM: WHAT WORKS!* , 1996 .

[12]  Jure Leskovec,et al.  Human Decisions and Machine Predictions , 2017, The quarterly journal of economics.

[13]  Richard A. Berk,et al.  Statistical Procedures for Forecasting Criminal Behavior , 2013 .

[14]  Jon M. Kleinberg,et al.  Inherent Trade-Offs in the Fair Determination of Risk Scores , 2016, ITCS.

[15]  Richard A. Berk,et al.  Overview of: “Statistical Procedures for Forecasting Criminal Behavior: A Comparative Assessment” , 2013 .

[16]  Eric Horvitz,et al.  Towards Accountable AI: Hybrid Human-Machine Analyses for Characterizing System Failure , 2018, HCOMP.

[17]  Rich Caruana,et al.  Auditing Black-Box Models Using Transparent Model Distillation With Side Information , 2017 .

[18]  Jure Leskovec,et al.  The Selective Labels Problem: Evaluating Algorithmic Predictions in the Presence of Unobservables , 2017, KDD.

[19]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[20]  Alexandra Chouldechova,et al.  Fairer and more accurate, but for whom? , 2017, ArXiv.

[21]  Eric Horvitz,et al.  Complementary computing: policies for transferring callers from dialog systems to human receptionists , 2006, User Modeling and User-Adapted Interaction.

[22]  Eric Horvitz,et al.  Discovering Blind Spots in Reinforcement Learning , 2018, AAMAS.

[23]  Alexandra Chouldechova,et al.  Fair prediction with disparate impact: A study of bias in recidivism prediction instruments , 2016, Big Data.

[25]  Eric Horvitz,et al.  Identifying Unknown Unknowns in the Open World: Representations and Policies for Guided Exploration , 2016, AAAI.

[26]  D. A. Andrews,et al.  The Recent Past and Near Future of Risk and/or Need Assessment , 2006 .

[27]  Hany Farid,et al.  The accuracy, fairness, and limits of predicting recidivism , 2018, Science Advances.

[28]  Weng-Keen Wong,et al.  Principles of Explanatory Debugging to Personalize Interactive Machine Learning , 2015, IUI.

[29]  Swami Sankaranarayanan,et al.  Face recognition accuracy of forensic examiners, superrecognizers, and face recognition algorithms , 2018, Proceedings of the National Academy of Sciences.

[30]  Cynthia Rudin,et al.  Interpretable classification models for recidivism prediction , 2015, 1503.07810.

[31]  Eric Horvitz,et al.  Combining human and machine intelligence in large-scale crowdsourcing , 2012, AAMAS.

[32]  김은이,et al.  Mean Shift Clustering을 이용한 영상 검색결과 개선 , 2009 .