An Arabic Corpus of Fake News: Collection, Analysis and Classification

Over the last years, with the explosive growth of social media, huge amounts of rumors have been rapidly spread on the internet. Indeed, the proliferation of malicious misinformation and nasty rumors in social media can have harmful effects on individuals and society. In this paper, we investigate the content of the fake news in the Arabic world through the information posted on YouTube. Our contribution is threefold. First, we introduce a novel Arab corpus for the task of fake news analysis, covering the topics most concerned by rumors. We describe the corpus and the data collection process in detail. Second, we present several exploratory analysis on the harvested data in order to retrieve some useful knowledge about the transmission of rumors for the studied topics. Third, we test the possibility of discrimination between rumor and no rumor comments using three machine learning classifiers namely, Support Vector Machine (SVM), Decision Tree (DT) and Multinomial Naive Bayes (MNB).

[1]  J. Ross Quinlan,et al.  Learning decision tree classifiers , 1996, CSUR.

[2]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[3]  Florian Zollmann The Propaganda Model In Manufacturing Consent : The Political Economy of the Mass Media , 2009 .

[4]  Justin Cheng,et al.  Rumor Cascades , 2014, ICWSM.

[5]  Arkaitz Zubiaga,et al.  Analysing How People Orient to and Spread Rumours in Social Media by Looking at Conversational Threads , 2015, PloS one.

[6]  Boris A. Galitsky Detecting Rumor and Disinformation by Web Mining , 2015, AAAI Spring Symposia.

[7]  A. Pentland,et al.  Life in the network: The coming age of computational social science: Science , 2009 .

[8]  Arkaitz Zubiaga,et al.  Learning Reporting Dynamics during Breaking News for Rumour Detection in Social Media , 2016, ArXiv.

[9]  Barbara Poblete,et al.  Information credibility on twitter , 2011, WWW.

[10]  Jacob Ratkiewicz,et al.  Detecting and Tracking the Spread of Astroturf Memes in Microblog Streams , 2010, ArXiv.

[11]  Stan Matwin,et al.  Large Scale Text Classification using Semisupervised Multinomial Naive Bayes , 2011, ICML.

[12]  Wei Gao,et al.  Detect Rumors Using Time Series of Social Context Information on Microblogging Websites , 2015, CIKM.

[13]  R. Procter,et al.  Reading the riots: what were the police doing on Twitter? , 2013 .

[14]  Georgi Georgiev,et al.  An Analysis of Event-Agnostic Features for Rumour Classification in Twitter , 2016, SMN@ICWSM.

[15]  Irina Rish,et al.  An empirical study of the naive Bayes classifier , 2001 .

[16]  Dragomir R. Radev,et al.  Rumor has it: Identifying Misinformation in Microblogs , 2011, EMNLP.

[17]  Arkaitz Zubiaga,et al.  Exploiting Context for Rumour Detection in Social Media , 2017, SocInfo.

[18]  A. Pentland,et al.  Computational Social Science , 2009, Science.

[19]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[20]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .