COVID-19 and Misinformation: A Large-Scale Lexical Analysis on Twitter

Social media is often used by individuals and organisations as a platform to spread misinformation. With the recent coronavirus pandemic we have seen a surge of misinformation on Twitter, posing a danger to public health. In this paper, we compile a large COVID-19 Twitter misinformation corpus and perform an analysis to discover patterns with respect to vocabulary usage. Among others, our analysis reveals that the variety of topics and vocabulary usage are considerably more limited and negative in tweets related to misinformation than in randomly extracted tweets. In addition to our qualitative analysis, our experimental results show that a simple linear model based only on lexical features is effective in identifying misinformation-related tweets (with accuracy over 80%), providing evidence to the fact that the vocabulary used in misinformation largely differs from generic tweets.

[1]  C. Wardle,et al.  Too little, too late: social media companies’ failure to tackle vaccine misinformation poses a real threat , 2021, BMJ.

[2]  Preslav Nakov,et al.  Fighting the COVID-19 Infodemic: Modeling the Perspective of Journalists, Fact-Checkers, Social Media Platforms, Policy Makers, and the Society , 2020, EMNLP.

[3]  Timothy R. Tangherlini,et al.  Conspiracy in the Time of Corona: Automatic detection of Covid-19 Conspiracy Theories in Social Media and the News , 2020, ArXiv.

[4]  Kristina Lerman,et al.  Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set , 2020, JMIR public health and surveillance.

[5]  Steven Schockaert,et al.  Learning Cross-lingual Embeddings from Twitter via Distant Supervision , 2019, ArXiv.

[6]  Hernán A. Makse,et al.  CUNY Academic Works , 2022 .

[7]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[8]  Wen-Ying Sylvia Chou,et al.  Addressing Health-Related Misinformation on Social Media. , 2018, JAMA.

[9]  Heiko Paulheim,et al.  Weakly Supervised Learning for Fake News Detection on Twitter , 2018, 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[10]  Irena Spasić,et al.  Sentinel: A Codesigned Platform for Semantic Enrichment of Social Media Streams , 2018, IEEE Transactions on Computational Social Systems.

[11]  Dan Mercea,et al.  The Brexit Botnet and User-Generated Hyperpartisan News , 2017 .

[12]  Sibel Adali,et al.  This Just In: Fake News Packs a Lot in Title, Uses Simpler, Repetitive Content in Text Body, More Similar to Satire than Real News , 2017, Proceedings of the International AAAI Conference on Web and Social Media.

[13]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[14]  Roberto Navigli,et al.  Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities , 2016, Artif. Intell..

[15]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[16]  Barbara Poblete,et al.  Information credibility on twitter , 2011, WWW.

[17]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[18]  James H. Fetzer Disinformation: The Use of False Information , 2004, Minds and Machines.

[19]  P. Lafon Sur la variabilité de la fréquence des formes dans un corpus , 1980 .