BiasFinder: Metamorphic Test Generation to Uncover Bias for Sentiment Analysis Systems

Artificial Intelligence (AI) software systems, such as Sentiment Analysis (SA) systems, typically learn from large amounts of data that may reflect human biases. Consequently, the machine learning model in such software systems may exhibit unintended demographic bias based on specific characteristics (e.g., gender, occupation, country-of-origin, etc.). Such biases manifest in an SA system when it predicts a different sentiment for similar texts that differ only in the characteristic of individuals described. Existing studies on revealing bias in SA systems rely on the production of sentences from a small set of short, predefined templates. To address this limitation, we present BiasFinder, an approach to discover biased predictions in SA systems via metamorphic testing. A key feature of BiasFinder is the automatic curation of suitable templates based on the pieces of text from a large corpus, using various Natural Language Processing (NLP) techniques to identify words that describe demographic characteristics. Next, BiasFinder instantiates new text from these templates by filling in placeholders with words associated with a class of a characteristic (e.g., gender-specific words such as female names, “she”, “her”). These texts are used to tease out bias in an SA system. BiasFinder identifies a bias-uncovering test case when it detects that the SA system exhibits demographic bias for a pair of texts, i.e., it predicts a different sentiment for texts that differ only in words associated with a different class (e.g., male vs. female) of a target characteristic (e.g., gender). Our empirical evaluation showed that BiasFinder can effectively create a large number of realistic and diverse test cases that uncover various biases in an SA system with a high true positive rate of up to 95.8%.

[1]  Parteek Kumar,et al.  A Sentiment Analysis System to Improve Teaching and Learning , 2017, Computer.

[2]  Adam Tauman Kalai,et al.  Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , 2016, NIPS.

[3]  Sven Apel,et al.  Classifying Developers into Core and Peripheral: An Empirical Study on Count and Network Metrics , 2016, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[4]  Sameer Singh,et al.  Beyond Accuracy: Behavioral Testing of NLP Models with CheckList , 2020, ACL.

[5]  Luke S. Zettlemoyer,et al.  Adversarial Example Generation with Syntactically Controlled Paraphrase Networks , 2018, NAACL.

[6]  Lucy Vasserman,et al.  Measuring and Mitigating Unintended Bias in Text Classification , 2018, AIES.

[7]  Ruslan Salakhutdinov,et al.  Revisiting LSTM Networks for Semi-Supervised Text Classification via Mixed Objective Function , 2019, AAAI.

[8]  Taghi M. Khoshgoftaar,et al.  Big Data: Deep Learning for financial sentiment analysis , 2018, Journal of Big Data.

[9]  Foutse Khomh,et al.  On rapid releases and software testing: a case study and a semi-systematic literature review , 2015, Empirical Software Engineering.

[10]  Ankur Taly,et al.  Counterfactual Fairness in Text Classification through Robustness , 2018, AIES.

[11]  Tim Menzies,et al.  Fairway: a way to build fair ML software , 2020, ESEC/SIGSOFT FSE.

[12]  Jayadev Bhaskaran,et al.  Good Secretaries, Bad Truck Drivers? Occupational Gender Stereotypes in Sentiment Analysis , 2019, Proceedings of the First Workshop on Gender Bias in Natural Language Processing.

[13]  Jieyu Zhao,et al.  Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods , 2018, NAACL.

[14]  Sarfraz Khurshid,et al.  DeepRoad: GAN-Based Metamorphic Testing and Input Validation Framework for Autonomous Driving Systems , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[15]  Xuanjing Huang,et al.  How to Fine-Tune BERT for Text Classification? , 2019, CCL.

[16]  Srikumar Krishnamoorthy,et al.  Sentiment analysis of financial news articles using performance indicators , 2017, Knowledge and Information Systems.

[17]  Shruti Kohli,et al.  Twitter sentiment analysis in healthcare using Hadoop and R , 2016, 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom).

[18]  Nathan Srebro,et al.  Equality of Opportunity in Supervised Learning , 2016, NIPS.

[19]  Mark Harman,et al.  Machine Learning Testing: Survey, Landscapes and Horizons , 2019, IEEE Transactions on Software Engineering.

[20]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[21]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[22]  Weizhe Yuan,et al.  BARTScore: Evaluating Generated Text as Text Generation , 2021, NeurIPS.

[23]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[24]  Xintao Wu,et al.  Counterfactual Fairness: Unidentification, Bound and Algorithm , 2019, IJCAI.

[25]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[26]  Margaret Mitchell,et al.  Perturbation Sensitivity Analysis to Detect Unintended Model Biases , 2019, EMNLP.

[27]  Mike Thelwall,et al.  Sentiment Analysis Is a Big Suitcase , 2017, IEEE Intelligent Systems.

[28]  Martin Haselmayer,et al.  Sentiment analysis of political communication: combining a dictionary approach with crowdcoding , 2016, Quality & Quantity.

[29]  Humberto Torres Marques-Neto,et al.  Using sentiment analysis to define twitter political users’ classes and their homophily during the 2016 American presidential election , 2018, Journal of Internet Services and Applications.

[30]  Joanna Bryson,et al.  Semantics derived automatically from language corpora contain human-like biases , 2016, Science.

[31]  Noah A. Smith,et al.  Dependency Parsing , 2009, Encyclopedia of Artificial Intelligence.

[32]  Minmin Chen,et al.  Efficient Vector Representation for Documents through Corruption , 2017, ICLR.

[33]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[34]  Zhi Quan Zhou,et al.  Metamorphic Testing for Machine Translations: MT4MT , 2018, 2018 25th Australasian Software Engineering Conference (ASWEC).

[35]  Ahmed Abbasi,et al.  Benchmarking Twitter Sentiment Analysis Tools , 2014, LREC.

[36]  Silvia Chiappa,et al.  Path-Specific Counterfactual Fairness , 2018, AAAI.

[37]  Sudipta Chattopadhyay,et al.  Automated Directed Fairness Testing , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[38]  Hwee Tou Ng,et al.  A Machine Learning Approach to Coreference Resolution of Noun Phrases , 2001, CL.

[39]  Mark Harman,et al.  Automatic Testing and Improvement of Machine Translation , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[40]  Sofia B. Dias,et al.  Sentiment Analysis Techniques and Applications in Education: A Survey , 2018, TECH-EDU.

[41]  Peter D. Turney Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews , 2002, ACL.

[42]  Anne Marie Piper,et al.  Addressing Age-Related Bias in Sentiment Analysis , 2018, CHI.

[43]  Mohamed Medhat Gaber,et al.  SA-E: Sentiment Analysis for Education , 2013 .

[44]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[45]  Yue Zhang,et al.  Sentence-State LSTM for Text Representation , 2018, ACL.

[46]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[47]  Tanasanee Phienthrakul,et al.  Sentiment Classification Using Document Embeddings Trained with Cosine Similarity , 2019, ACL.

[48]  Marco Tulio Valente,et al.  What Skills do IT Companies look for in New Developers? A Study with Stack Overflow Jobs , 2020, Inf. Softw. Technol..

[49]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[50]  Walaa Medhat,et al.  Sentiment analysis algorithms and applications: A survey , 2014 .

[51]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[52]  Matt J. Kusner,et al.  Counterfactual Fairness , 2017, NIPS.

[53]  Christoph Treude,et al.  How Modern News Aggregators Help Development Communities Shape and Share Knowledge , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[54]  Saif Mohammad,et al.  Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems , 2018, *SEMEVAL.

[55]  Danqi Chen,et al.  A Fast and Accurate Dependency Parser using Neural Networks , 2014, EMNLP.

[56]  Sameer Singh,et al.  Generating Natural Adversarial Examples , 2017, ICLR.

[57]  Zhen Qin,et al.  Are Pre-trained Convolutions Better than Pre-trained Transformers? , 2021, ArXiv.

[58]  David Lo,et al.  BiasHeal: On-the-Fly Black-Box Healing of Bias in Sentiment Analysis Systems , 2021, 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[59]  Arvind Narayanan,et al.  Semantics derived automatically from language corpora contain human-like biases , 2016, Science.

[60]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[61]  Roxana Geambasu,et al.  FairTest: Discovering Unwarranted Associations in Data-Driven Applications , 2015, 2017 IEEE European Symposium on Security and Privacy (EuroS&P).

[62]  Sonal Gupta,et al.  Muppet: Massive Multi-task Representations with Pre-Finetuning , 2021, EMNLP.

[63]  David Lo,et al.  BiasRV: uncovering biased sentiment predictions at runtime , 2021, ESEC/SIGSOFT FSE.

[64]  Andreas Zeller,et al.  Restoring Execution Environments of Jupyter Notebooks , 2021, 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE).

[65]  Xuanjing Huang,et al.  Information Aggregation via Dynamic Routing for Sequence Encoding , 2018, COLING.

[66]  Oladapo Oyebode,et al.  Using Machine Learning and Thematic Analysis Methods to Evaluate Mental Health Apps Based on User Reviews , 2020, IEEE Access.

[67]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[68]  Pushpak Bhattacharyya,et al.  Medical Sentiment Analysis using Social Media: Towards building a Patient Assisted System , 2018, LREC.

[69]  Min-Yuh Day,et al.  Deep learning for financial sentiment analysis on finance news providers , 2016, 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[70]  Sudipta Chattopadhyay,et al.  Astraea: Grammar-based Fairness Testing , 2022, IEEE Transactions on Software Engineering.

[71]  Thomas Renault,et al.  Sentiment analysis and machine learning in finance: a comparison of methods and models on one million messages , 2019, Digital Finance.

[72]  Liqun Sun,et al.  Metamorphic testing of driverless cars , 2019, Commun. ACM.

[73]  Jin Liu,et al.  Metamorphic Testing and Certified Mitigation of Fairness Violations in NLP Models , 2020, IJCAI.

[74]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[75]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[76]  Carlos Guestrin,et al.  Semantically Equivalent Adversarial Rules for Debugging NLP models , 2018, ACL.

[77]  Soujanya Poria,et al.  Beneath the Tip of the Iceberg: Current Challenges and New Directions in Sentiment Analysis Research , 2020, ArXiv.

[78]  Yuriy Brun,et al.  Fairness testing: testing software for discrimination , 2017, ESEC/SIGSOFT FSE.

[79]  João Gama,et al.  MARKETING RESEARCH: THE ROLE OF SENTIMENT ANALYSIS , 2013 .

[80]  Richard Socher,et al.  SummEval: Re-evaluating Summarization Evaluation , 2020, ArXiv.