WikiCheck: An End-to-end Open Source Automatic Fact-Checking API based on Wikipedia

With the growth of fake news and disinformation, the NLP community has been working to assist humans in fact-checking. However, most academic research has focused on model accuracy without paying attention to resource efficiency, which is crucial in real-life scenarios. In this work, we review the State-of-the-Art datasets and solutions for Automatic Fact-checking and test their applicability in production environments. We discover overfitting issues in those models, and we propose a data filtering method that improves the model's performance and generalization. Then, we design an unsupervised fine-tuning of the Masked Language models to improve its accuracy working with Wikipedia. We also propose a novel query enhancing method to improve evidence discovery using the Wikipedia Search API. Finally, we present a new fact-checking system, the WikiCheck API that automatically performs a facts validation process based on the Wikipedia knowledge base. It is comparable to SOTA solutions in terms of accuracy and can be used on low-memory CPU instances.

[1]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[2]  Andreas Vlachos,et al.  FEVER: a Large-scale Dataset for Fact Extraction and VERification , 2018, NAACL.

[3]  William Yang Wang “Liar, Liar Pants on Fire”: A New Benchmark Dataset for Fake News Detection , 2017, ACL.

[4]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[5]  Aaron Halfaker,et al.  Keeping Community in the Loop: Understanding Wikipedia Stakeholder Values for Machine Learning-Based Systems , 2020, CHI.

[6]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[7]  Andreas Vlachos,et al.  Fact Checking: Task definition and dataset construction , 2014, LTCSS@ACL.

[8]  Hai Zhao,et al.  Semantics-aware BERT for Language Understanding , 2020, AAAI.

[9]  Sebastian Riedel,et al.  UCL Machine Reading Group: Four Factor Framework For Fact Finding (HexaF) , 2018, FEVER@EMNLP.

[10]  Kyunghyun Cho,et al.  Dynamic Meta-Embeddings for Improved Sentence Representations , 2018, EMNLP.

[11]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[12]  Aaron Halfaker,et al.  ORES: Lowering Barriers with Participatory Machine Learning in Wikipedia , 2020, Proc. ACM Hum. Comput. Interact..

[13]  P. Singer,et al.  LikeWar: The Weaponization of Social Media , 2018 .

[14]  Anton Chernyavskiy,et al.  WhatTheWikiFact: Fact-Checking Claims Against Wikipedia , 2021, International Conference on Information and Knowledge Management.

[15]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[16]  Andreas Vlachos,et al.  The Fact Extraction and VERification (FEVER) Shared Task , 2018, FEVER@EMNLP.

[17]  Holger Schwenk,et al.  Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , 2017, EMNLP.

[18]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[19]  Iryna Gurevych,et al.  UKP-Athene: Multi-Sentence Textual Entailment for Claim Verification , 2018, FEVER@EMNLP.

[20]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[21]  Zhen-Hua Ling,et al.  Enhancing and Combining Sequential and Tree LSTM for Natural Language Inference , 2016, ArXiv.

[22]  Haonan Chen,et al.  Combining Fact Extraction and Verification with Neural Semantic Matching Networks , 2018, AAAI.

[23]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[24]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[25]  Nan Hua,et al.  Universal Sentence Encoder , 2018, ArXiv.

[26]  Joonsuk Park,et al.  Automated Fact-Checking of Claims from Wikipedia , 2020, LREC.

[27]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[28]  Jonathan Pilault,et al.  Conditionally Adaptive Multi-Task Learning: Improving Transfer Learning in NLP Using Fewer Parameters & Less Data , 2020, ArXiv.

[29]  Jörg Tiedemann,et al.  Sentence embeddings in NLI with iterative refinement encoders , 2018, Natural Language Engineering.

[30]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[31]  Chengkai Li,et al.  ClaimBuster: The First-ever End-to-end Fact-checking System , 2017, Proc. VLDB Endow..

[32]  Xiaodong Liu,et al.  Multi-Task Deep Neural Networks for Natural Language Understanding , 2019, ACL.

[33]  Diego Saez-Trumper,et al.  Online Disinformation and the Role of Wikipedia , 2019, ArXiv.

[34]  Tony Doyle,et al.  Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy , 2017, Inf. Soc..