Feature-Based Explanations Don't Help People Detect Misclassifications of Online Toxicity

We present an experimental assessment of the impact of feature attribution-style explanations on human performance in predicting the consensus toxicity of social media posts with advice from an unreliable machine learning model. By doing so we add to a small but growing body of literature inspecting the utility of interpretable machine learning in terms of human outcomes. We also evaluate interpretable machine learning for the first time in the important domain of online toxicity, where fully-automated methods have faced criticism as being inadequate as a measure of toxic behavior.We find that, contrary to expectations, explanations have no significant impact on accuracy or agreement with model predictions, through they do change the distribution of subject error somewhat while reducing the cognitive burden of the task for subjects. Our results contribute to the recognition of an intriguing expectation gap in the field of interpretable machine learning between the general excitement the field has engendered and the ambiguous results of recent experimental work, including this study.

[1]  Byron C. Wallace,et al.  Attention is not Explanation , 2019, NAACL.

[2]  Jure Leskovec,et al.  Human Decisions and Machine Predictions , 2017, The quarterly journal of economics.

[3]  Cliff Lampe,et al.  When Online Harassment Is Perceived as Justified , 2018, ICWSM.

[4]  Tommi S. Jaakkola,et al.  Rethinking Cooperative Rationalization: Introspective Extraction and Complement Control , 2019, EMNLP.

[5]  Mohan S. Kankanhalli,et al.  Trends and Trajectories for Explainable, Accountable and Intelligible Systems: An HCI Research Agenda , 2018, CHI.

[6]  Steve Whittaker,et al.  Dice in the Black Box: User Experiences with an Inscrutable Algorithm , 2018, AAAI Spring Symposia.

[7]  Yuval Pinter,et al.  Attention is not not Explanation , 2019, EMNLP.

[8]  Jure Leskovec,et al.  Faithful and Customizable Explanations of Black Box Models , 2019, AIES.

[9]  Kush R. Varshney,et al.  The Limits of Abstract Evaluation Metrics: The Case of Hate Speech Detection , 2017, WebSci.

[10]  Bernard J. Jansen,et al.  Developing an online hate classifier for multiple social media platforms , 2020, Human-centric Computing and Information Sciences.

[11]  Klaus-Robert Müller,et al.  "What is relevant in a text document?": An interpretable machine learning approach , 2016, PloS one.

[12]  Carlos Eduardo Scheidegger,et al.  Assessing the Local Interpretability of Machine Learning Models , 2019, ArXiv.

[13]  Lucy Vasserman,et al.  Measuring and Mitigating Unintended Bias in Text Classification , 2018, AIES.

[14]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[15]  Ankur Taly,et al.  Axiomatic Attribution for Deep Networks , 2017, ICML.

[16]  Eric Horvitz,et al.  Updates in Human-AI Teams: Understanding and Addressing the Performance/Compatibility Tradeoff , 2019, AAAI.

[17]  Avanti Shrikumar,et al.  Learning Important Features Through Propagating Activation Differences , 2017, ICML.

[18]  Shivakant Mishra,et al.  Analyzing Labeled Cyberbullying Incidents on the Instagram Social Network , 2015, SocInfo.

[19]  Qiaozhu Mei,et al.  Extractive Adversarial Networks: High-Recall Explanations for Identifying Personal Attacks in Social Media Posts , 2018, EMNLP.

[20]  Michael Veale,et al.  Like Trainer, Like Bot? Inheritance of Bias in Algorithmic Content Moderation , 2017, SocInfo.

[21]  Daniel G. Goldstein,et al.  Manipulating and Measuring Model Interpretability , 2018, CHI.

[22]  Cliff Lampe,et al.  Classification and Its Consequences for Online Harassment , 2017, Proc. ACM Hum. Comput. Interact..

[23]  Karrie Karahalios,et al.  Communicating Algorithmic Process in Online Behavioral Advertising , 2018, CHI.

[24]  Kim Halskov,et al.  UX Design Innovation: Challenges for Working with Machine Learning as a Design Material , 2017, CHI.

[25]  Munmun De Choudhury,et al.  Multimodal Classification of Moderated Online Pro-Eating Disorder Content , 2017, CHI.

[26]  Yejin Choi,et al.  The Risk of Racial Bias in Hate Speech Detection , 2019, ACL.

[27]  Lucas Dixon,et al.  Ex Machina: Personal Attacks Seen at Scale , 2016, WWW.

[28]  Joel R. Tetreault,et al.  Finding Good Conversations Online: The Yahoo News Annotated Comments Corpus , 2017, LAW@ACL.

[29]  John Pavlopoulos,et al.  Deep Learning for User Comment Moderation , 2017, ALW@ACL.

[30]  Vivian Lai,et al.  On Human Predictions with Explanations and Predictions of Machine Learning Models: A Case Study on Deception Detection , 2018, FAT.

[31]  Cindy Wang Interpreting Neural Network Hate Speech Classifiers , 2018, ALW.

[32]  Chandan Singh,et al.  Definitions, methods, and applications in interpretable machine learning , 2019, Proceedings of the National Academy of Sciences.

[33]  Lalana Kagal,et al.  J un 2 01 8 Explaining Explanations : An Approach to Evaluating Interpretability of Machine Learning , 2018 .

[34]  Shervin Malmasi,et al.  Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018) , 2018 .

[35]  Ona de Gibert,et al.  Hate Speech Dataset from a White Supremacy Forum , 2018, ALW.

[36]  Dietram A. Scheufele,et al.  Toxic Talk: How Online Incivility Can Undermine Perceptions of Media , 2016 .

[37]  Eshwar Chandrasekharan,et al.  Crossmod: A Cross-Community Learning-based System to Assist Reddit Moderators , 2019, Proc. ACM Hum. Comput. Interact..

[38]  Sérgio Nunes,et al.  A Survey on Automatic Detection of Hate Speech in Text , 2018, ACM Comput. Surv..

[39]  Dong Nguyen,et al.  Comparing Automatic and Human Evaluation of Local Explanations for Text Classification , 2018, NAACL.

[40]  Been Kim,et al.  Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.

[41]  Regina Barzilay,et al.  Rationalizing Neural Predictions , 2016, EMNLP.

[42]  Wendy E. Mackay,et al.  Human-Centred Machine Learning , 2016, CHI Extended Abstracts.

[43]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[44]  Byron C. Wallace,et al.  ERASER: A Benchmark to Evaluate Rationalized NLP Models , 2020, ACL.

[45]  Mykola Pechenizkiy,et al.  A Human-Grounded Evaluation of SHAP for Alert Processing , 2019, ArXiv.

[46]  Shi Feng,et al.  Pathologies of Neural Models Make Interpretations Difficult , 2018, EMNLP.

[47]  Shervin Malmasi,et al.  Challenges in discriminating profanity from hate speech , 2017, J. Exp. Theor. Artif. Intell..

[48]  Cody Buntain,et al.  A Large Labeled Corpus for Online Harassment Research , 2017, WebSci.

[49]  Emily Chen,et al.  How do Humans Understand Explanations from Machine Learning Systems? An Evaluation of the Human-Interpretability of Explanation , 2018, ArXiv.

[50]  Radha Poovendran,et al.  Deceiving Google's Perspective API Built for Detecting Toxic Comments , 2017, ArXiv.

[51]  Mária Bieliková,et al.  Improving Moderation of Online Discussions via Interpretable Neural Models , 2018, ALW.

[52]  A. Bruckman,et al.  Human-Machine Collaboration for Content Regulation: The Case of Reddit Automoderator , 2019, ACM Trans. Comput. Hum. Interact..

[53]  Casey Fiesler,et al.  Reddit Rules! Characterizing an Ecosystem of Governance , 2018, ICWSM.