How Robust are Fact Checking Systems on Colloquial Claims?

Knowledge is now starting to power neural dialogue agents. At the same time, the risk of misinformation and disinformation from dialogue agents also rises. Verifying the veracity of information from formal sources are widely studied in computational fact checking. In this work, we ask: How robust are fact checking systems on claims in colloquial style? We aim to open up new discussions in the intersection of fact verification and dialogue safety. In order to investigate how fact checking systems behave on colloquial claims, we transfer the styles of claims from FEVER (Thorne et al., 2018) into colloquialism. We find that existing fact checking systems that perform well on claims in formal style significantly degenerate on colloquial claims with the same semantics. Especially, we show that document retrieval is the weakest spot in the system even vulnerable to filler words, such as “yeah” and “you know”. The document recall of WikiAPI retriever (Hanselowski et al., 2018) which is 90.0% on FEVER, drops to 72.2% on the colloquial claims. We compare the characteristics of colloquial claims to those of claims in formal style, and demonstrate the challenging issues in them.

[1]  Jason Weston,et al.  Queens are Powerful too: Mitigating Gender Bias in Dialogue Generation , 2020, EMNLP.

[2]  Jason Weston,et al.  Reading Wikipedia to Answer Open-Domain Questions , 2017, ACL.

[3]  Maosong Sun,et al.  GEAR: Graph-based Evidence Aggregating and Reasoning for Fact Verification , 2019, ACL.

[4]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[5]  Jakob Grue Simonsen,et al.  Generating Fact Checking Explanations , 2020, ACL.

[6]  Suhang Wang,et al.  Fake News Detection on Social Media: A Data Mining Perspective , 2017, SKDD.

[7]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[8]  Dominik Stammbach,et al.  Team DOMLIN: Exploiting Evidence Enhancement for the FEVER Shared Task , 2019, EMNLP.

[9]  Peter Henderson,et al.  Ethical Challenges in Data-Driven Dialogue Systems , 2017, AIES.

[10]  Xiaodong Liu,et al.  Conversing by Reading: Contentful Neural Conversation with On-demand Machine Reading , 2019, ACL.

[11]  Yejin Choi,et al.  WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale , 2020, AAAI.

[12]  Arkaitz Zubiaga,et al.  SemEval-2019 Task 7: RumourEval, Determining Rumour Veracity and Support for Rumours , 2019, *SEMEVAL.

[13]  Xiaoyan Zhu,et al.  Commonsense Knowledge Aware Conversation Generation with Graph Attention , 2018, IJCAI.

[14]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[15]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[16]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[17]  Jason Weston,et al.  Wizard of Wikipedia: Knowledge-Powered Conversational agents , 2018, ICLR.

[18]  Ming Zhou,et al.  Reasoning Over Semantic-Level Graph for Fact Checking , 2020, ACL.

[19]  Zhonghai Wu,et al.  Diverse and Informative Dialogue Generation with Context-Specific Commonsense Knowledge Awareness , 2020, ACL.

[20]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[21]  Keith W. Miller,et al.  Why we should have seen that coming: comments on Microsoft's tay "experiment," and wider implications , 2017, CSOC.

[22]  Preslav Nakov,et al.  Integrating Stance Detection and Fact Checking in a Unified Corpus , 2018, NAACL.

[23]  Jason Weston,et al.  ParlAI: A Dialog Research Software Platform , 2017, EMNLP.

[24]  Jianfeng Gao,et al.  DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation , 2020, ACL.

[25]  Haonan Chen,et al.  Combining Fact Extraction and Verification with Neural Semantic Matching Networks , 2018, AAAI.

[26]  Lucas Dixon,et al.  Ex Machina: Personal Attacks Seen at Scale , 2016, WWW.

[27]  Christos Christodoulopoulos,et al.  Evaluating adversarial attacks against multiple fact verification systems , 2019, EMNLP.

[28]  Maria Janicka,et al.  GEM: Generative Enhanced Model for adversarial attacks , 2019, EMNLP.

[29]  Isabelle Augenstein,et al.  Generating Label Cohesive and Well-Formed Adversarial Claims , 2020, EMNLP.

[30]  Dmitry Ilvovsky,et al.  Extract and Aggregate: A Novel Domain-Independent Approach to Factual Data Verification , 2019, EMNLP.

[31]  Jason Weston,et al.  Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack , 2019, EMNLP.

[32]  Mitesh M. Khapra,et al.  Towards Exploiting Background Knowledge for Building Conversation Systems , 2018, EMNLP.

[33]  Ronan Le Bras,et al.  Adversarial Filters of Dataset Biases , 2020, ICML.

[34]  Alan W. Black,et al.  A Dataset for Document Grounded Conversations , 2018, EMNLP.

[35]  Yejin Choi,et al.  The Risk of Racial Bias in Hate Speech Detection , 2019, ACL.

[36]  Iryna Gurevych,et al.  UKP-Athene: Multi-Sentence Textual Entailment for Claim Verification , 2018, FEVER@EMNLP.

[37]  Christopher D. Manning,et al.  Key-Value Retrieval Networks for Task-Oriented Dialogue , 2017, SIGDIAL Conference.

[38]  Wenhu Chen,et al.  TabFact: A Large-scale Dataset for Table-based Fact Verification , 2019, ICLR.

[39]  Thanh Tran,et al.  HABERTOR: An Efficient and Effective Deep Hatespeech Detector , 2020, EMNLP.

[40]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[41]  Christian Hansen,et al.  MultiFC: A Real-World Multi-Domain Dataset for Evidence-Based Fact Checking of Claims , 2019, EMNLP.

[42]  Maosong Sun,et al.  Coreferential Reasoning Learning for Language Representation , 2020, EMNLP.

[43]  Luo Si,et al.  Rumor Detection on Social Media: Datasets, Methods and Opportunities , 2019, EMNLP.

[44]  Mary Williamson,et al.  Recipes for Building an Open-Domain Chatbot , 2020, EACL.

[45]  Andreas Vlachos,et al.  FEVER: a Large-scale Dataset for Fact Extraction and VERification , 2018, NAACL.

[46]  Rahul Goel,et al.  Detecting Offensive Content in Open-domain Conversations using Two Stage Semi-supervision , 2018, ArXiv.

[47]  Christopher D. Manning,et al.  Stanza: A Python Natural Language Processing Toolkit for Many Human Languages , 2020, ACL.

[48]  Shikha Bordia,et al.  HoVer: A Dataset for Many-Hop Fact Extraction And Claim Verification , 2020, FINDINGS.

[49]  Hannaneh Hajishirzi,et al.  Fact or Fiction: Verifying Scientific Claims , 2020, EMNLP.

[50]  Jianfeng Gao,et al.  A Persona-Based Neural Conversation Model , 2016, ACL.

[51]  Ming-Wei Chang,et al.  A Knowledge-Grounded Neural Conversation Model , 2017, AAAI.

[52]  Fabio Petroni,et al.  Generating Fact Checking Briefs , 2020, EMNLP.

[53]  Teresa K. O'Leary,et al.  Patient and Consumer Safety Risks When Using Conversational Assistants for Medical Information: An Observational Study of Siri, Alexa, and Google Assistant , 2018, Journal of medical Internet research.

[54]  Wei Wu,et al.  StyleDGPT: Stylized Response Generation with Pre-trained Language Models , 2020, FINDINGS.

[55]  Erik Cambria,et al.  Augmenting End-to-End Dialogue Systems With Commonsense Knowledge , 2018, AAAI.