DynaSent: A Dynamic Benchmark for Sentiment Analysis

We introduce DynaSent (‘Dynamic Sentiment’), a new English-language benchmark task for ternary (positive/negative/neutral) sentiment analysis. DynaSent combines naturally occurring sentences with sentences created using the open-source Dynabench Platform, which facilities human-and-model-inthe-loop dataset creation. DynaSent has a total of 121,634 sentences, each validated by five crowdworkers, and its development and test splits are designed to produce chance performance for even the best models we have been able to develop; when future models solve this task, we will use them to create DynaSent version 2, continuing the dynamic evolution of this benchmark. Here, we report on the dataset creation effort, focusing on the steps we took to increase quality and reduce artifacts. We also present evidence that DynaSent’s Neutral category is more coherent than the comparable category in other benchmarks, and we motivate training models from scratch for each round over successive fine-tuning.

[1]  Chandler May,et al.  Social Bias in Elicited Natural Language Inferences , 2017, EthNLP@EACL.

[2]  Navneet Kaur,et al.  Opinion mining and sentiment analysis , 2016, 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom).

[3]  Jianmo Ni,et al.  Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects , 2019, EMNLP.

[4]  Roy Schwartz,et al.  Inoculation by Fine-Tuning: A Method for Analyzing Challenge Datasets , 2019, NAACL.

[5]  Rachel Rudinger,et al.  Gender Bias in Coreference Resolution , 2018, NAACL.

[6]  Geoffrey Sampson,et al.  Word frequency distributions , 2002, Computational Linguistics.

[7]  Lei Zhang,et al.  Sentiment Analysis and Opinion Mining , 2017, Encyclopedia of Machine Learning and Data Mining.

[8]  Terry Winograd,et al.  Understanding natural language , 1974 .

[9]  Mani B. Srivastava,et al.  Generating Natural Language Adversarial Examples , 2018, EMNLP.

[10]  A. P. Dawid,et al.  Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm , 1979 .

[11]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[12]  Sebastian Riedel,et al.  Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension , 2020, Transactions of the Association for Computational Linguistics.

[13]  Masatoshi Tsuchiya,et al.  Performance Impact Caused by Hidden Bias of Training Data for Recognizing Textual Entailment , 2018, LREC.

[14]  Zachary C. Lipton,et al.  How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks , 2018, EMNLP.

[15]  Peter Szolovits,et al.  Is BERT Really Robust? Natural Language Attack on Text Classification and Entailment , 2019, ArXiv.

[16]  LiChenliang,et al.  Adversarial Attacks on Deep-learning Models in Natural Language Processing , 2020 .

[17]  Guoliang Li,et al.  Truth Inference in Crowdsourcing: Is the Problem Solved? , 2017, Proc. VLDB Endow..

[18]  Yejin Choi,et al.  SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference , 2018, EMNLP.

[19]  Ali Farhadi,et al.  HellaSwag: Can a Machine Really Finish Your Sentence? , 2019, ACL.

[20]  Preslav Nakov,et al.  SemEval-2016 Task 4: Sentiment Analysis in Twitter , 2016, *SEMEVAL.

[21]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[22]  Iyad Rahwan,et al.  Darling or Babygirl ? Investigating Stylistic Bias in Sentiment Analysis , 2018 .

[23]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[24]  Yoav Goldberg,et al.  Breaking NLI Systems with Sentences that Require Simple Lexical Inferences , 2018, ACL.

[25]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[26]  Yonatan Belinkov,et al.  Don’t Take the Premise for Granted: Mitigating Artifacts in Natural Language Inference , 2019, ACL.

[27]  Percy Liang,et al.  Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.

[28]  Claire Cardie,et al.  Annotating Expressions of Opinions and Emotions in Language , 2005, Lang. Resour. Evaluation.

[29]  Besnik Fetahu,et al.  Understanding and Mitigating Worker Biases in the Crowdsourced Collection of Subjective Judgments , 2019, CHI.

[30]  Janyce Wiebe,et al.  Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis , 2005, HLT.

[31]  Amy Beth Warriner,et al.  Norms of valence, arousal, and dominance for 13,915 English lemmas , 2013, Behavior Research Methods.

[32]  Saif Mohammad,et al.  Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems , 2018, *SEMEVAL.

[33]  Timnit Gebru,et al.  Datasheets for datasets , 2018, Commun. ACM.

[34]  Stefan Dietze,et al.  Using Worker Self-Assessments for Competence-Based Pre-Selection in Crowdsourcing Microtasks , 2017, ACM Trans. Comput. Hum. Interact..

[35]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[36]  Tal Linzen,et al.  How Can We Accelerate Progress Towards Human-like Linguistic Generalization? , 2020, ACL.

[37]  Jure Leskovec,et al.  Learning Attitudes and Attributes from Multi-aspect Reviews , 2012, 2012 IEEE 12th International Conference on Data Mining.

[38]  Mohit Bansal,et al.  Analyzing Compositionality-Sensitivity of NLI Models , 2018, AAAI.

[39]  Rachel Rudinger,et al.  Hypothesis Only Baselines in Natural Language Inference , 2018, *SEMEVAL.

[40]  Andrea Esuli,et al.  SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining , 2010, LREC.

[41]  Carolyn Penstein Rosé,et al.  Stress Test Evaluation for Natural Language Inference , 2018, COLING.

[42]  Gabriel Stanovsky,et al.  DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs , 2019, NAACL.

[43]  Sameer Singh,et al.  Universal Adversarial Triggers for Attacking and Analyzing NLP , 2019, EMNLP.

[44]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[45]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[46]  Han-Yu Chen,et al.  Adversarial Attack on Sentiment Classification , 2019, WNLP@ACL.

[47]  Bing Liu,et al.  Opinion spam and analysis , 2008, WSDM '08.

[48]  Yejin Choi,et al.  The Risk of Racial Bias in Hate Speech Detection , 2019, ACL.

[49]  Thomas Wolf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[50]  Hector J. Levesque,et al.  On our best behaviour , 2014, Artif. Intell..

[51]  Christopher D. Manning,et al.  Towards Ecologically Valid Research on Language User Interfaces , 2020, ArXiv.

[52]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[53]  Regina Barzilay,et al.  Towards Debiasing Fact Verification Models , 2019, EMNLP.

[54]  Bo Pang,et al.  Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , 2005, ACL.

[55]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[56]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[57]  Mohit Bansal,et al.  Adversarial NLI: A New Benchmark for Natural Language Understanding , 2020, ACL.

[58]  Philip J. Stone,et al.  A computer approach to content analysis: studies using the General Inquirer system , 1963, AFIPS Spring Joint Computing Conference.