Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias

Common methods for interpreting neural models in natural language processing typically examine either their structure or their behavior, but not both. We propose a methodology grounded in the theory of causal mediation analysis for interpreting which parts of a model are causally implicated in its behavior. It enables us to analyze the mechanisms by which information flows from input to output through various model components, known as mediators. We apply this methodology to analyze gender bias in pre-trained Transformer language models. We study the role of individual neurons and attention heads in mediating gender bias across three datasets designed to gauge a model's sensitivity to gender bias. Our mediation analysis reveals that gender bias effects are (i) sparse, concentrated in a small part of the network; (ii) synergistic, amplified or repressed by different components; and (iii) decomposable into effects flowing directly from the input and indirectly through the mediators.

[1]  Yoav Goldberg,et al.  Adversarial Removal of Demographic Attributes from Text Data , 2018, EMNLP.

[2]  Morteza Zadimoghaddam,et al.  Submodular Maximization with Nearly Optimal Approximation, Adaptivity and Query Complexity , 2018, SODA.

[3]  Tyler J. VanderWeele,et al.  Conceptual issues concerning mediation, interventions and composition , 2009 .

[4]  Chandler May,et al.  Social Bias in Elicited Natural Language Inferences , 2017, EthNLP@EACL.

[5]  Toniann Pitassi,et al.  Fairness through awareness , 2011, ITCS '12.

[6]  Franck Dernoncourt,et al.  Comparing deep learning and concept extraction based methods for patient phenotyping from clinical narratives , 2018, PloS one.

[7]  Kyomin Jung,et al.  Effective Sentence Scoring Method Using BERT for Speech Recognition , 2019, ACML.

[8]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[9]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[10]  Yang Trista Cao,et al.  Toward Gender-Inclusive Coreference Resolution , 2019, ACL.

[11]  Saif Mohammad,et al.  Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems , 2018, *SEMEVAL.

[12]  Klaus-Robert Müller,et al.  Layer-Wise Relevance Propagation: An Overview , 2019, Explainable AI.

[13]  Willem H. Zuidema,et al.  Visualisation and 'diagnostic classifiers' reveal how recurrent and recursive neural networks process hierarchical structure , 2017, J. Artif. Intell. Res..

[14]  J. Pearl Causal diagrams for empirical research , 1995 .

[15]  Noe Casas,et al.  Evaluating the Underlying Gender Bias in Contextualized Word Embeddings , 2019, Proceedings of the First Workshop on Gender Bias in Natural Language Processing.

[16]  Alexandra Chouldechova,et al.  What’s in a Name? Reducing Bias in Bios without Access to Protected Attributes , 2019, NAACL.

[17]  Yusu Qian,et al.  Reducing Gender Bias in Word-Level Language Models with a Gender-Equalizing Loss Function , 2019, ACL.

[18]  Trevor Hastie,et al.  Causal Interpretations of Black-Box Models , 2019, Journal of business & economic statistics : a publication of the American Statistical Association.

[19]  Dekang Lin,et al.  Bootstrapping Path-Based Pronoun Resolution , 2006, ACL.

[20]  Alexander M. Rush,et al.  Visual Interaction with Deep Learning Models through Collaborative Semantic Inference , 2019, IEEE Transactions on Visualization and Computer Graphics.

[21]  Yonatan Belinkov,et al.  Analyzing the Structure of Attention in a Transformer Language Model , 2019, BlackboxNLP@ACL.

[22]  John Hewitt,et al.  Designing and Interpreting Probes with Control Tasks , 2019, EMNLP.

[23]  Arvind Narayanan,et al.  Semantics derived automatically from language corpora contain human-like biases , 2016, Science.

[24]  Yaron Singer,et al.  Fast Parallel Algorithms for Feature Selection , 2019, ArXiv.

[25]  Yonatan Belinkov,et al.  Analysis Methods in Neural Language Processing: A Survey , 2018, TACL.

[26]  Yonatan Belinkov,et al.  Investigating Gender Bias in Language Models Using Causal Mediation Analysis , 2020, NeurIPS.

[27]  P. Holland Statistics and Causal Inference , 1985 .

[28]  Yonatan Belinkov,et al.  Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks , 2016, ICLR.

[29]  Sebastian Gehrmann,et al.  exBERT: A Visual Analysis Tool to Explore Learned Representations in Transformers Models , 2019, ArXiv.

[30]  Chen Avin,et al.  Identifiability of Path-Specific Effects , 2005, IJCAI.

[31]  Klaus-Robert Müller,et al.  "What is relevant in a text document?": An interpretable machine learning approach , 2016, PloS one.

[32]  Jason Baldridge,et al.  Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns , 2018, TACL.

[33]  Adam Tauman Kalai,et al.  Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , 2016, NIPS.

[34]  Anupam Datta,et al.  Gender Bias in Neural Natural Language Processing , 2018, Logic, Language, and Security.

[35]  Ryan Cotterell,et al.  Gender Bias in Contextualized Word Embeddings , 2019, NAACL.

[36]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[37]  Eduard Hovy,et al.  Learning the Difference that Makes a Difference with Counterfactually-Augmented Data , 2020, ICLR.

[38]  Yoav Goldberg,et al.  Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them , 2019, NAACL-HLT.

[39]  Alexander M. Rush,et al.  LSTMVis: A Tool for Visual Analysis of Hidden State Dynamics in Recurrent Neural Networks , 2016, IEEE Transactions on Visualization and Computer Graphics.

[40]  Jieyu Zhao,et al.  Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods , 2018, NAACL.

[41]  Laurence A. Wolsey,et al.  Best Algorithms for Approximating the Maximum of a Submodular Set Function , 1978, Math. Oper. Res..

[42]  Yi Chern Tan,et al.  Assessing Social and Intersectional Biases in Contextualized Word Representations , 2019, NeurIPS.

[43]  Guillaume Lample,et al.  What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties , 2018, ACL.

[44]  Matt J. Kusner,et al.  Counterfactual Fairness , 2017, NIPS.

[45]  Rico Sennrich,et al.  How Grammatical is Character-level Neural Machine Translation? Assessing MT Quality with Contrastive Translation Pairs , 2016, EACL.

[46]  L. Keele,et al.  Identification, Inference and Sensitivity Analysis for Causal Mediation Effects , 2010, 1011.1079.

[47]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[48]  L. Keele,et al.  A General Approach to Causal Mediation Analysis , 2010, Psychological methods.

[49]  Bin Yu,et al.  Beyond Word Importance: Contextual Decomposition to Extract Interactions from LSTMs , 2018, ICLR.

[50]  Eric Balkanski,et al.  The adaptive complexity of maximizing a submodular function , 2018, STOC.

[51]  Dipanjan Das,et al.  BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[52]  Omer Levy,et al.  What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[53]  Alan W Black,et al.  Quantifying Social Biases in Contextual Word Representations , 2019, ACL 2019.

[54]  Davis Liang,et al.  Masked Language Model Scoring , 2020, ACL.

[55]  Eric Balkanski,et al.  Non-monotone Submodular Maximization in Exponentially Fewer Iterations , 2018, NeurIPS.

[56]  J. Robins,et al.  Identifiability and Exchangeability for Direct and Indirect Effects , 1992, Epidemiology.

[57]  Eric Balkanski,et al.  Approximation Guarantees for Adaptive Sampling , 2018, ICML.

[58]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[59]  Desmond Elliott,et al.  Adversarial Removal of Demographic Attributes Revisited , 2019, EMNLP.

[60]  Judea Pearl,et al.  Direct and Indirect Effects , 2001, UAI.

[61]  Jesse Vig,et al.  A Multiscale Visualization of Attention in the Transformer Model , 2019, ACL.

[62]  Morteza Zadimoghaddam,et al.  Non-monotone Submodular Maximization with Nearly Optimal Adaptivity and Query Complexity , 2018, ICML.

[63]  Toniann Pitassi,et al.  Fairness through Causal Awareness: Learning Causal Latent-Variable Models for Biased Data , 2018, FAT.

[64]  Florian Mohnert,et al.  Under the Hood: Using Diagnostic Classifiers to Investigate and Improve how Language Models Track Agreement Information , 2018, BlackboxNLP@EMNLP.

[65]  Carolyn Penstein Rosé,et al.  Stress Test Evaluation for Natural Language Inference , 2018, COLING.

[66]  Joseph Naor,et al.  Submodular Maximization with Cardinality Constraints , 2014, SODA.

[67]  Huy L. Nguyen,et al.  Submodular Maximization with Nearly-optimal Approximation and Adaptivity in Nearly-linear Time , 2018, SODA.

[68]  Po-Sen Huang,et al.  Reducing Sentiment Bias in Language Models via Counterfactual Evaluation , 2019, FINDINGS.

[69]  Juan Feng,et al.  A Causal Inference Method for Reducing Gender Bias in Word Embedding Relations , 2019, AAAI.

[70]  Anna Rumshisky,et al.  Revealing the Dark Secrets of BERT , 2019, EMNLP.

[71]  Rachel Rudinger,et al.  Gender Bias in Coreference Resolution , 2018, NAACL.

[72]  Ryan Cotterell,et al.  It’s All in the Name: Mitigating Gender Bias with Name-Based Counterfactual Data Substitution , 2019, EMNLP.

[73]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[74]  Zeyu Li,et al.  Learning Gender-Neutral Word Embeddings , 2018, EMNLP.

[75]  Pierre Isabelle,et al.  A Challenge Set Approach to Evaluating Machine Translation , 2017, EMNLP.

[76]  Xinlei Chen,et al.  Visualizing and Understanding Neural Models in NLP , 2015, NAACL.

[77]  Alex Wang,et al.  BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model , 2019, Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation.

[78]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.