Actionable Auditing: Investigating the Impact of Publicly Naming Biased Performance Results of Commercial AI Products

Although algorithmic auditing has emerged as a key strategy to expose systematic biases embedded in software platforms, we struggle to understand the real-world impact of these audits, as scholarship on the impact of algorithmic audits on increasing algorithmic fairness and transparency in commercial systems is nascent. To analyze the impact of publicly naming and disclosing performance results of biased AI systems, we investigate the commercial impact of Gender Shades, the first algorithmic audit of gender and skin type performance disparities in commercial facial analysis models. This paper 1) outlines the audit design and structured disclosure procedure used in the Gender Shades study, 2) presents new performance metrics from targeted companies IBM, Microsoft and Megvii (Face++) on the Pilot Parliaments Benchmark (PPB) as of August 2018, 3) provides performance results on PPB by non-target companies Amazon and Kairos and, 4) explores differences in company responses as shared through corporate communications that contextualize differences in performance on PPB. Within 7 months of the original audit, we find that all three targets released new API versions. All targets reduced accuracy disparities between males and females and darker and lighter-skinned subgroups, with the most significant update occurring for the darker-skinned female subgroup, that underwent a 17.7% - 30.4% reduction in error between audit periods. Minimizing these disparities led to a 5.72% to 8.3% reduction in overall error on the Pilot Parliaments Benchmark (PPB) for target corporation APIs. The overall performance of non-targets Amazon and Kairos lags significantly behind that of the targets, with error rates of 8.66% and 6.60% overall, and error rates of 31.37% and 22.50% for the darker female subgroup, respectively.

[1]  Suresh Venkatasubramanian,et al.  Auditing black-box models for indirect influence , 2016, Knowledge and Information Systems.

[2]  Karrie Karahalios,et al.  MapWatch: Detecting and Monitoring International Border Personalization on Online Maps , 2016, WWW.

[3]  Nicholas Diakopoulos,et al.  Accountability in algorithmic decision making , 2016, Commun. ACM.

[4]  Alice J. O'Toole,et al.  An other-race effect for face recognition algorithms , 2011, TAP.

[5]  Timnit Gebru,et al.  Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification , 2018, FAT.

[6]  Patrick J. Grother,et al.  Face Recognition Vendor Test (FRVT) - Performance of Automated Gender Classification Algorithms , 2015 .

[7]  David Sontag,et al.  Why Is My Classifier Discriminatory? , 2018, NeurIPS.

[8]  Michael Luca,et al.  Digital Discrimination: The Case of Airbnb.com , 2014 .

[9]  Anil K. Jain,et al.  Face Recognition Performance: Role of Demographic Information , 2012, IEEE Transactions on Information Forensics and Security.

[10]  Franco Turini,et al.  A Survey of Methods for Explaining Black Box Models , 2018, ACM Comput. Surv..

[11]  Karrie Karahalios,et al.  A path to understanding the effects of algorithm awareness , 2014, CHI Extended Abstracts.

[12]  Christopher King,et al.  The CERT Guide to Coordinated Vulnerability Disclosure , 2017 .

[13]  Krishna P. Gummadi,et al.  Quantifying Search Bias: Investigating Sources of Bias for Political Searches in Social Media , 2017, CSCW.

[14]  Chris Russell,et al.  Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR , 2017, ArXiv.

[15]  K. Crenshaw Demarginalizing the Intersection of Race and Sex: A Black Feminist Critique of Antidiscrimination Doctrine, Feminist Theory and Antiracist Politics , 1989 .

[16]  Jenna Burrell,et al.  How the machine ‘thinks’: Understanding opacity in machine learning algorithms , 2016 .

[17]  Christo Wilson,et al.  Peeking Beneath the Hood of Uber , 2015, Internet Measurement Conference.

[18]  Vijay Erramilli,et al.  Detecting price and search discrimination on the internet , 2012, HotNets-XI.

[19]  Roxana Geambasu,et al.  Discovering Unwarranted Associations in Data-Driven Applications with the FairTest Testing Toolkit , 2015, ArXiv.

[20]  Karrie Karahalios,et al.  Auditing Algorithms : Research Methods for Detecting Discrimination on Internet Platforms , 2014 .

[21]  Hyeonjoon Moon,et al.  The FERET evaluation methodology for face-recognition algorithms , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[22]  Karrie Karahalios,et al.  "Be Careful; Things Can Be Worse than They Appear": Understanding Biased Algorithms and Users' Behavior Around Them in Rating Platforms , 2017, ICWSM.

[23]  David García,et al.  Bias in Online Freelance Marketplaces: Evidence from TaskRabbit and Fiverr , 2017, CSCW.