Adversarial Removal of Demographic Attributes Revisited

Elazar and Goldberg (2018) showed that protected attributes can be extracted from the representations of a debiased neural network for mention detection at above-chance levels, by evaluating a diagnostic classifier on a held-out subsample of the data it was trained on. We revisit their experiments and conduct a series of follow-up experiments showing that, in fact, the diagnostic classifier generalizes poorly to both new in-domain samples and new domains, indicating that it relies on correlations specific to their particular data sample. We further show that a diagnostic classifier trained on the biased baseline neural network also does not generalize to new samples. In other words, the biases detected in Elazar and Goldberg (2018) seem restricted to their particular data sample, and would therefore not bias the decisions of the model on new samples, whether in-domain or out-of-domain. In light of this, we discuss better methodologies for detecting bias in our models.

[1]  Robert Frank,et al.  Open Sesame: Getting inside BERT’s Linguistic Knowledge , 2019, BlackboxNLP@ACL.

[2]  Timothy Baldwin,et al.  Towards Robust and Privacy-preserving Text Representations , 2018, ACL.

[3]  Katrina Ligett,et al.  Penalizing Unfairness in Binary Classification , 2017 .

[4]  John D. Burger,et al.  Discriminating Gender on Twitter , 2011, EMNLP.

[5]  Yoav Goldberg,et al.  Adversarial Removal of Demographic Attributes from Text Data , 2018, EMNLP.

[6]  Benno Stein,et al.  Overview of the 2 nd Author Profiling Task at PAN 2014 , 2014 .

[7]  John Langford,et al.  A Reductions Approach to Fair Classification , 2018, ICML.

[8]  Francis M. Tyers,et al.  Can LSTM Learn to Capture Agreement? The Case of Basque , 2018, BlackboxNLP@EMNLP.

[9]  Benno Stein,et al.  Overview of the 4th Author Profiling Task at PAN 2016: Cross-Genre Evaluations , 2016, CLEF.

[10]  Edward Raff,et al.  Gradient Reversal Against Discrimination , 2018, ArXiv.

[11]  Ryan Cotterell,et al.  Gender Bias in Contextualized Word Embeddings , 2019, NAACL.

[12]  Zhe Zhao,et al.  Data Decisions and Theoretical Implications when Adversarially Learning Fair Representations , 2017, ArXiv.

[13]  Graham Neubig,et al.  Controllable Invariance through Adversarial Feature Learning , 2017, NIPS.

[14]  Osbert Bastani,et al.  Interpretability via Model Extraction , 2017, ArXiv.

[15]  Amir Globerson,et al.  Nightmare at test time: robust learning by feature deletion , 2006, ICML.

[16]  Lyle H. Ungar,et al.  Analyzing Biases in Human Perception of User Age and Gender from Text , 2016, ACL.

[17]  Victor S. Lempitsky,et al.  Unsupervised Domain Adaptation by Backpropagation , 2014, ICML.

[18]  Yashesh Gaur,et al.  Robust Speech Recognition Using Generative Adversarial Networks , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).