论文信息 - Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias

Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias

Common methods for interpreting neural models in natural language processing typically examine either their structure or their behavior, but not both. We propose a methodology grounded in the theory of causal mediation analysis for interpreting which parts of a model are causally implicated in its behavior. It enables us to analyze the mechanisms by which information flows from input to output through various model components, known as mediators. We apply this methodology to analyze gender bias in pre-trained Transformer language models. We study the role of individual neurons and attention heads in mediating gender bias across three datasets designed to gauge a model's sensitivity to gender bias. Our mediation analysis reveals that gender bias effects are (i) sparse, concentrated in a small part of the network; (ii) synergistic, amplified or repressed by different components; and (iii) decomposable into effects flowing directly from the input and indirectly through the mediators.

[1] Yoav Goldberg,et al. Adversarial Removal of Demographic Attributes from Text Data , 2018, EMNLP.

[2] Morteza Zadimoghaddam,et al. Submodular Maximization with Nearly Optimal Approximation, Adaptivity and Query Complexity , 2018, SODA.

[3] Tyler J. VanderWeele,et al. Conceptual issues concerning mediation, interventions and composition , 2009 .

[4] Chandler May,et al. Social Bias in Elicited Natural Language Inferences , 2017, EthNLP@EACL.

[5] Toniann Pitassi,et al. Fairness through awareness , 2011, ITCS '12.

[6] Franck Dernoncourt,et al. Comparing deep learning and concept extraction based methods for patient phenotyping from clinical narratives , 2018, PloS one.

[7] Kyomin Jung,et al. Effective Sentence Scoring Method Using BERT for Speech Recognition , 2019, ACML.

[8] Yiming Yang,et al. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[9] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[10] Yang Trista Cao,et al. Toward Gender-Inclusive Coreference Resolution , 2019, ACL.

[11] Saif Mohammad,et al. Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems , 2018, *SEMEVAL.

[12] Klaus-Robert Müller,et al. Layer-Wise Relevance Propagation: An Overview , 2019, Explainable AI.

[13] Willem H. Zuidema,et al. Visualisation and 'diagnostic classifiers' reveal how recurrent and recursive neural networks process hierarchical structure , 2017, J. Artif. Intell. Res..

[14] J. Pearl. Causal diagrams for empirical research , 1995 .

[15] Noe Casas,et al. Evaluating the Underlying Gender Bias in Contextualized Word Embeddings , 2019, Proceedings of the First Workshop on Gender Bias in Natural Language Processing.

[16] Alexandra Chouldechova,et al. What’s in a Name? Reducing Bias in Bios without Access to Protected Attributes , 2019, NAACL.

[17] Yusu Qian,et al. Reducing Gender Bias in Word-Level Language Models with a Gender-Equalizing Loss Function , 2019, ACL.

[18] Trevor Hastie,et al. Causal Interpretations of Black-Box Models , 2019, Journal of business & economic statistics : a publication of the American Statistical Association.

[19] Dekang Lin,et al. Bootstrapping Path-Based Pronoun Resolution , 2006, ACL.

[20] Alexander M. Rush,et al. Visual Interaction with Deep Learning Models through Collaborative Semantic Inference , 2019, IEEE Transactions on Visualization and Computer Graphics.

[21] Yonatan Belinkov,et al. Analyzing the Structure of Attention in a Transformer Language Model , 2019, BlackboxNLP@ACL.

[22] John Hewitt,et al. Designing and Interpreting Probes with Control Tasks , 2019, EMNLP.

[23] Arvind Narayanan,et al. Semantics derived automatically from language corpora contain human-like biases , 2016, Science.

[24] Yaron Singer,et al. Fast Parallel Algorithms for Feature Selection , 2019, ArXiv.

[25] Yonatan Belinkov,et al. Analysis Methods in Neural Language Processing: A Survey , 2018, TACL.

[26] Yonatan Belinkov,et al. Investigating Gender Bias in Language Models Using Causal Mediation Analysis , 2020, NeurIPS.

[27] P. Holland. Statistics and Causal Inference , 1985 .

[28] Yonatan Belinkov,et al. Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks , 2016, ICLR.

[29] Sebastian Gehrmann,et al. exBERT: A Visual Analysis Tool to Explore Learned Representations in Transformers Models , 2019, ArXiv.

[30] Chen Avin,et al. Identifiability of Path-Specific Effects , 2005, IJCAI.