Exploring N-gram, Word Embedding and Topic Models for Content-based Fake News Detection in FakeNewsNet Evaluation

FakeNewsNet is a repository of two novel datasets, PolitiFact and GossipCop, which are employed for evaluation of fake news detection techniques. Unlike other extensively studied benchmark fake news datasets, the FakeNewsNet datasets incorporate news content, social context, and dynamic information, which could be used to study fake news propagation, detection, and mitigation. Existing works on FakeNewsNet have focused on one-hot encoding, social contexts such as user-based models, and dynamic information such as news propagation model. However, n-gram, word embeddings, and topic models of news contents, which have been impressive in other contexts have not been explored. This paper therefore explores n-gram, word embeddings, and topic models of news contents for the evaluation of FakeNewsNet datasets. Unigram-based n-gram model, skipgram word2vec-based word embeddings model and Latent Dirichlet Allocation-based topic model are extracted after preprocessing the datasets. The features are weighted by TFIDF to overcome the shortcomings of the individual models and analyzed using Logistic Regression. The evaluation of the models and their hybrids shows that n-gram model outperforms word embedding and topic models. Specifically, n-gram model records accuracy, precision, recall and F1-score of 0.80, 0.79, 0.78 and 0.79, respectively for PolitiFact and records 0.82, 0.75, 0.79 and 0.77, respectively for GossipCop. The comparison with benchmarks also shows that the performance of n-gram model is better. General Terms Machine Learning, Computational Linguistics

[1]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[2]  Monther Aldwairi,et al.  Detecting Fake News in Social Media Networks , 2018, EUSPN/ICTH.

[3]  Simrat Ahluwalia,et al.  Fake News Detection: A Deep Learning Approach , 2018 .

[4]  Shervin Malmasi,et al.  Challenges in discriminating profanity from hate speech , 2017, J. Exp. Theor. Artif. Intell..

[5]  Sérgio Nunes,et al.  A Survey on Automatic Detection of Hate Speech in Text , 2018, ACM Comput. Surv..

[6]  Aditya Gaydhani,et al.  Detecting Hate Speech and Offensive Language on Twitter using Machine Learning: An N-gram and TFIDF based Approach , 2018, ArXiv.

[7]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[8]  Haiyan Wang,et al.  Detecting fake news over online social media via domain reputations and content understanding , 2020, Tsinghua Science and Technology.

[9]  Deying Li,et al.  Rumor Blocking through Online Link Deletion on Social Networks , 2019, ACM Trans. Knowl. Discov. Data.

[10]  Huan Liu,et al.  FakeNewsNet: A Data Repository with News Content, Social Context and Dynamic Information for Studying Fake News on Social Media , 2018, ArXiv.

[11]  Paolo Rosso,et al.  Stance Detection in Fake News A Combined Feature Representation , 2018 .

[12]  William Yang Wang “Liar, Liar Pants on Fire”: A New Benchmark Dataset for Fake News Detection , 2017, ACL.

[13]  Shuo Yang,et al.  Unsupervised Fake News Detection on Social Media: A Generative Approach , 2019, AAAI.

[14]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[15]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[16]  Juliana Freire,et al.  A Topic-Agnostic Approach for Identifying Fake News Pages , 2019, WWW.