Weakly Supervised Learning for Fake News Detection on Twitter

The problem of automatic detection of fake news in social media, e.g., on Twitter, has recently drawn some attention. Although, from a technical perspective, it can be regarded as a straight-forward, binary classification problem, the major challenge is the collection of large enough training corpora, since manual annotation of tweets as fake or non-fake news is an expensive and tedious endeavor. In this paper, we discuss a weakly supervised approach, which automatically collects a large-scale, but very noisy training dataset comprising hundreds of thousands of tweets. During collection, we automatically label tweets by their source, i.e., trustworthy or untrustworthy source, and train a classifier on this dataset. We then use that classifier for a different classification target, i.e., the classification of fake and non-fake tweets. Although the labels are not accurate according to the new classification target (not all tweets by an untrustworthy source need to be fake news, and vice versa), we show that despite this unclean inaccurate dataset, it is possible to detect fake news with an F1 score of up to 0.9.

[1]  Filippo Menczer,et al.  The rise of social bots , 2014, Commun. ACM.

[2]  Xiuzhen Zhang,et al.  User Perception of Information Credibility of News on Twitter , 2014, ECIR.

[3]  Gianluca Stringhini,et al.  Detecting spammers on social networks , 2010, ACSAC '10.

[4]  Zhi-Hua Zhou,et al.  A brief introduction to weakly supervised learning , 2018 .

[5]  Virgílio A. F. Almeida,et al.  Detecting Spammers on Twitter , 2010 .

[6]  Sudhir Kumar Sharma,et al.  Twitter sentiment analysis using various classification algorithms , 2016, 2016 5th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO).

[7]  Jacob Ratkiewicz,et al.  Detecting and Tracking Political Abuse in Social Media , 2011, ICWSM.

[8]  Suhang Wang,et al.  Fake News Detection on Social Media: A Data Mining Perspective , 2017, SKDD.

[9]  Tarek F. Abdelzaher,et al.  Finding true and credible information on Twitter , 2014, 17th International Conference on Information Fusion (FUSION).

[10]  Jeanna Neefe Matthews,et al.  Fake Twitter accounts: profile characteristics obtained using an activity-based pattern detection approach , 2015, SMSociety.

[11]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[12]  Mahmoud A. Mahmoud,et al.  Fake Account Detection in Twitter Based on Minimum Weighted Feature set , 2015 .

[13]  Lluís A. Belanche Muñoz,et al.  Feature selection algorithms: a survey and experimental evaluation , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[14]  Walter Daelemans,et al.  “Vreselijk mooi!” (terribly beautiful): A Subjectivity Lexicon for Dutch Adjectives. , 2012, LREC.

[15]  Ponnurangam Kumaraguru,et al.  Credibility ranking of tweets during high impact events , 2012, PSOSM '12.

[16]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[17]  Shuang-Hong Yang,et al.  Large-scale high-precision topic modeling on twitter , 2014, KDD.

[18]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[19]  Andrea Esuli,et al.  SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining , 2010, LREC.

[20]  Barbara Poblete,et al.  Information credibility on twitter , 2011, WWW.

[21]  Preslav Nakov,et al.  SemEval-2015 Task 10: Sentiment Analysis in Twitter , 2015, *SEMEVAL.