Machine Generation and Detection of Arabic Manipulated and Fake News

Fake news and deceptive machine-generated text are serious problems threatening modern societies, including in the Arab world. This motivates work on detecting false and manipulated stories online. However, a bottleneck for this research is lack of sufficient data to train detection models. We present a novel method for automatically generating Arabic manipulated (and potentially fake) news stories. Our method is simple and only depends on availability of true stories, which are abundant online, and a part of speech tagger (POS). To facilitate future work, we dispense with both of these requirements altogether by providing AraNews, a novel and large POS-tagged news dataset that can be used off-the-shelf. Using stories generated based on AraNews, we carry out a human annotation study that casts light on the effects of machine manipulation on text veracity. The study also measures human ability to detect Arabic machine manipulated text generated by our method. Finally, we develop the first models for detecting manipulated Arabic news and achieve state-of-the-art results on Arabic fake news detection (macro F1=70.06). Our models and data are publicly available.

[1]  M. Gentzkow,et al.  Social Media and Fake News in the 2016 Election , 2017 .

[2]  Petr Sojka,et al.  Gensim -- Statistical Semantics in Python , 2011 .

[3]  Andreas Vlachos,et al.  Emergent: a novel data-set for stance classification , 2016, NAACL.

[4]  Bilel Elayeb,et al.  ANT Corpus: An Arabic News Text Collection for Textual Classification , 2017, 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA).

[5]  Preslav Nakov,et al.  Overview of the CLEF-2019 CheckThat! Lab: Automatic Identification and Verification of Claims. Task 2: Evidence and Factuality , 2019, CLEF.

[6]  Preslav Nakov,et al.  Joint Multitask Learning for Community Question Answering Using Task-Specific Embeddings , 2018, EMNLP.

[7]  Andreas Vlachos,et al.  FEVER: a Large-scale Dataset for Fact Extraction and VERification , 2018, NAACL.

[8]  Mani B. Srivastava,et al.  Generating Natural Language Adversarial Examples , 2018, EMNLP.

[9]  Eunsol Choi,et al.  Truth of Varying Shades: Analyzing Language in Fake News and Political Fact-Checking , 2017, EMNLP.

[10]  Russell Torres,et al.  Epistemology in the Era of Fake News , 2018, Data Base.

[11]  Verónica Pérez-Rosas,et al.  Automatic Detection of Fake News , 2017, COLING.

[12]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[13]  Jude Khouja,et al.  Stance Prediction and Claim Verification: An Arabic Perspective , 2020, FEVER.

[14]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[15]  Preslav Nakov,et al.  Overview of the CLEF-2018 CheckThat! Lab on Automatic Identification and Verification of Political Claims. Task 1: Check-Worthiness , 2018, CLEF.

[16]  Smaranda Muresan,et al.  Where is Your Evidence: Improving Fact-checking by Justification Modeling , 2018 .

[17]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[18]  Andreas Vlachos,et al.  Automated Fact Checking: Task Formulations, Methods and Future Directions , 2018, COLING.

[19]  Carlos Guestrin,et al.  Semantically Equivalent Adversarial Rules for Debugging NLP models , 2018, ACL.

[20]  C. Ireton Journalism, 'fake news' and disinformation: handbook for journalism education and training , 2018 .

[21]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[22]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[23]  Benno Stein,et al.  A Stylometric Inquiry into Hyperpartisan and Fake News , 2017, ACL.

[24]  S. Lecheler,et al.  Fake news as a two-dimensional phenomenon: a framework and research agenda , 2019, Annals of the International Communication Association.

[25]  Preslav Nakov,et al.  Integrating Stance Detection and Fact Checking in a Unified Corpus , 2018, NAACL.

[26]  Ming Zhou,et al.  Reasoning Over Semantic-Level Graph for Fact Checking , 2020, ACL.

[27]  Nizar Habash,et al.  Developing an Egyptian Arabic Treebank: Impact of Dialectal Morphology on Annotation and Tool Development , 2014, LREC.

[28]  Maria Janicka,et al.  GEM: Generative Enhanced Model for adversarial attacks , 2019, EMNLP.

[29]  Yimin Chen,et al.  Automatic deception detection: Methods for finding fake news , 2015, ASIST.

[30]  M. Maamouri,et al.  The Penn Arabic Treebank: Building a Large-Scale Annotated Arabic Corpus , 2004 .

[31]  Smaranda Muresan,et al.  DeSePtion: Dual Sequence Prediction and Adversarial Examples for Improved Fact-Checking , 2020, ACL.

[32]  Walid Magdy,et al.  Improved Stance Prediction in a User Similarity Feature Space , 2017, ASONAM.

[33]  Karima Meftouh,et al.  An Arabic Corpus of Fake News: Collection, Analysis and Classification , 2019, ICALP.

[34]  Christos Christodoulopoulos,et al.  Evaluating adversarial attacks against multiple fact verification systems , 2019, EMNLP.

[35]  Yimin Chen,et al.  Deception detection for news: Three types of fakes , 2015, ASIST.

[36]  Preslav Nakov,et al.  CheckThat! at CLEF 2020: Enabling the Automatic Identification and Verification of Claims in Social Media , 2020, ECIR.

[37]  Hazem M. Hajj,et al.  AraBERT: Transformer-based Model for Arabic Language Understanding , 2020, OSACT.

[38]  Francesco Marcelloni,et al.  A survey on fake news and rumour detection techniques , 2019, Inf. Sci..

[39]  Preslav Nakov,et al.  Fine-Grained Analysis of Propaganda in News Article , 2019, EMNLP.

[40]  Chuan Yu,et al.  Trends in the diffusion of misinformation on social media , 2018, Research & Politics.

[41]  Didier Schwab,et al.  ArbEngVec : Arabic-English Cross-Lingual Word Embedding Model , 2019, WANLP@ACL 2019.

[42]  Bernhard Schölkopf,et al.  Leveraging the Crowd to Detect and Reduce the Spread of Fake News and Misinformation , 2017, WSDM.

[43]  Kam-Fai Wong,et al.  Sentence-Level Evidence Embedding for Claim Verification with Hierarchical Attention Networks , 2019, ACL.

[44]  Maosong Sun,et al.  GEAR: Graph-based Evidence Aggregating and Reasoning for Fact Verification , 2019, ACL.

[45]  Didier Schwab,et al.  Semantic Similarity of Arabic Sentences with Word Embeddings , 2017, WANLP@EACL.

[46]  Ali Farhadi,et al.  Defending Against Neural Fake News , 2019, NeurIPS.

[47]  Wassim El-Hajj,et al.  Assessing Arabic Weblog Credibility via Deep Co-learning , 2019, WANLP@ACL 2019.

[48]  William Yang Wang “Liar, Liar Pants on Fire”: A New Benchmark Dataset for Fake News Detection , 2017, ACL.

[49]  Chris Callison-Burch,et al.  Seeing Things from a Different Angle:Discovering Diverse Perspectives about Claims , 2019, NAACL.

[50]  Jakob Grue Simonsen,et al.  Generating Fact Checking Explanations , 2020, ACL.