State of the Art Models for Fake News Detection Tasks

This paper presents state of the art methods for addressing three important challenges in automated fake news detection: fake news detection, domain identification, and bot identification in tweets. The proposed solutions achieved first place in a recent international competition on fake news. For fake news detection, we present two models. The winning model in the competition combines similarity between the embedding of each article's title and the embedding of the top five corresponding google search results. The new model relies on advances in Natural Language Understanding (NLU) end to end deep learning models to identify stylistic differences between legitimate and fake news articles. This second model was developed after the competition and outperforms the winning approach. For news domain detection, the winning model is a hybrid approach composed of named entity features concatenated with semantic embeddings derived from end to end models. For twitter bot detection, we propose to use the following features: duration between account creation and tweet date, presence of a tweet's link, presence of user's location, other tweet's features, and the tweets' metadata. Experiments include insights into the importance of the different features and the results indicate the superior performances of all proposed models.

[1]  Victoria L. Rubin,et al.  Fake News or Truth? Using Satirical Cues to Detect Potentially Misleading News , 2016 .

[2]  Benno Stein,et al.  A Stylometric Inquiry into Hyperpartisan and Fake News , 2017, ACL.

[3]  Kyumin Lee,et al.  Uncovering social spammers: social honeypots + machine learning , 2010, SIGIR.

[4]  James Allan,et al.  Text classification and named entities for new event detection , 2004, SIGIR '04.

[5]  Balasubramanian Raman,et al.  Combining Neural, Statistical and External Features for Fake News Stance Identification , 2018, WWW.

[6]  Gerhard Weikum,et al.  Leveraging Joint Interactions for Credibility Analysis in News Communities , 2015, CIKM.

[7]  J. Michael Schultz,et al.  Towards a Universal dictionary for multi-language information retrieval applications , 2002 .

[8]  Timothy N. Rubin,et al.  Statistical topic models for multi-label document classification , 2011, Machine Learning.

[9]  Harith Alani,et al.  On Semantics and Deep Learning for Event Detection in Crisis Situations , 2017 .

[10]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[11]  Jonathan G. Fiscus,et al.  Topic detection and tracking evaluation overview , 2002 .

[12]  Li Zhao,et al.  Attention-based LSTM for Aspect-level Sentiment Classification , 2016, EMNLP.

[13]  Pável Calado,et al.  Combining Similarity Features and Deep Representation Learning for Stance Detection in the Context of Checking Fake News , 2018, ACM J. Data Inf. Qual..

[14]  Christof Koch,et al.  AdaBoost for Text Detection in Natural Scene , 2011, 2011 International Conference on Document Analysis and Recognition.

[15]  Iryna Gurevych,et al.  A Retrospective Analysis of the Fake News Challenge Stance-Detection Task , 2018, COLING.

[16]  Sinan Aral,et al.  The spread of true and false news online , 2018, Science.

[17]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[18]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[19]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[20]  Cheong Hee Park,et al.  Emerging topic detection in twitter stream based on high utility pattern mining , 2019, Expert Syst. Appl..

[21]  Donald E. Brown,et al.  RMDL: Random Multimodel Deep Learning for Classification , 2018, ICISDM '18.

[22]  Eunsol Choi,et al.  Truth of Varying Shades: Analyzing Language in Fake News and Political Fact-Checking , 2017, EMNLP.

[23]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[24]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[25]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[26]  Rachel Greenstadt,et al.  Detecting Hoaxes, Frauds, and Deception in Writing Style Online , 2012, 2012 IEEE Symposium on Security and Privacy.

[27]  Gibran Fuentes-Pineda,et al.  Topic discovery in massive text corpora based on Min-Hashing , 2018 .

[28]  Ning Ding,et al.  Event Detection with Trigger-Aware Lattice Neural Network , 2019, EMNLP.

[29]  Daniele Quercia,et al.  TweetLDA: supervised topic classification and link prediction in Twitter , 2012, WebSci '12.

[30]  Verónica Pérez-Rosas,et al.  Automatic Detection of Fake News , 2017, COLING.

[31]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[32]  Tamer Elsayed,et al.  Detecting Automatically-Generated Arabic Tweets , 2015, AIRS.

[33]  Jeffrey A. Gottfried,et al.  News use across social media platforms 2016 , 2016 .

[34]  Jon Crowcroft,et al.  Of Bots and Humans (on Twitter) , 2017, ASONAM.

[35]  Preslav Nakov,et al.  Predicting Factuality of Reporting and Bias of News Media Sources , 2018, EMNLP.

[36]  Peng Han,et al.  A Framework for Detecting Key Topics in Social Networks , 2019, ICBDT.

[37]  Junghoo Cho,et al.  Social-network analysis using topic models , 2012, SIGIR '12.

[38]  Sergey I. Nikolenko,et al.  Topic modelling for qualitative studies , 2017, J. Inf. Sci..

[39]  Fakhri Karray,et al.  Tools and approaches for topic detection from Twitter streams: survey , 2017, Knowledge and Information Systems.