Adapting Open Domain Fact Extraction and Verification to COVID-FACT through In-Domain Language Modeling

With the epidemic of COVID-19, verifying the scientifically false online information, such as fake news and maliciously fabricated statements, has become crucial. However, the lack of training data in the scientific domain limits the performance of fact verification models. This paper proposes an in-domain language modeling method for fact extraction and verification systems. We come up with SciKGAT to combine the advantages of open-domain literature search, state-of-the-art fact verification systems and in-domain medical knowledge through language modeling. Our experiments on SCIFACT, a dataset of expert-written scientific fact verification, show that SciKGAT achieves 30% absolute improvement on precision. Our analyses show that such improvement thrives from our in-domain language model by picking up more related evidence pieces and accurate fact verification. Our codes and data are released via Github.

[1]  Jason Weston,et al.  Reading Wikipedia to Answer Open-Domain Questions , 2017, ACL.

[2]  Maosong Sun,et al.  GEAR: Graph-based Evidence Aggregating and Reasoning for Fact Verification , 2019, ACL.

[3]  Sameer Singh,et al.  Detecting COVID-19 Misinformation on Social Media , 2020 .

[4]  Hannaneh Hajishirzi,et al.  Fact or Fiction: Verifying Scientific Claims , 2020, EMNLP.

[5]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[6]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[7]  Maosong Sun,et al.  Coreferential Reasoning Learning for Language Representation , 2020, EMNLP.

[8]  Andreas Vlachos,et al.  FEVER: a Large-scale Dataset for Fact Extraction and VERification , 2018, NAACL.

[9]  Haonan Chen,et al.  Combining Fact Extraction and Verification with Neural Semantic Matching Networks , 2018, AAAI.

[10]  Nazli Goharian,et al.  SLEDGE: A Simple Yet Effective Baseline for Coronavirus Scientific Knowledge Search , 2020, ArXiv.

[11]  Kyle Lo,et al.  SciBERT: Pretrained Contextualized Embeddings for Scientific Text , 2019, ArXiv.

[12]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[13]  Matteo Cinelli,et al.  The COVID-19 social media infodemic , 2020, Scientific reports.

[14]  Jianfeng Gao,et al.  A Human Generated MAchine Reading COmprehension Dataset , 2018 .

[15]  Orestis Papakyriakopoulos,et al.  NLP-based Feature Extraction for the Detection of COVID-19 Misinformation Videos on YouTube , 2020, NLPCOVID19.

[16]  Doug Downey,et al.  Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[17]  Chengkai Li,et al.  Jennifer for COVID-19: An NLP-Powered Chatbot Built for the People and by the People to Combat Misinformation , 2020, NLPCOVID19.

[18]  Qian Chen,et al.  Several Experiments on Investigating Pretraining and Knowledge-Enhanced Models for Natural Language Inference , 2019, ArXiv.

[19]  Marcel Worring,et al.  BERT for Evidence Retrieval and Claim Verification , 2019, ECIR.