Deriving Disinformation Insights from Geolocalized Twitter Callouts

This paper demonstrates a two-stage method for deriving insights from social media data relating to disinformation by applying a combination of geospatial classification and embedding-based language modelling across multiple languages. In particular, the analysis in centered on Twitter and disinformation for three European languages: English, French and Spanish. Firstly, Twitter data is classified into European and non-European sets using BERT. Secondly, Word2vec is applied to the classified texts resulting in Eurocentric, non-Eurocentric and global representations of the data for the three target languages. This comparative analysis demonstrates not only the efficacy of the classification method but also highlights geographic, temporal and linguistic differences in the disinformationrelated media. Thus, the contributions of the work are threefold: (i) a novel language-independent transformer-based geolocation method; (ii) an analytical approach that exploits lexical specificity and word embeddings to interrogate user-generated content; and (iii) a dataset of 36 million disinformation related tweets in English, French and Spanish.

[1]  Andreas Vlachos,et al.  Automated Fact Checking: Task Formulations, Methods and Future Directions , 2018, COLING.

[2]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[3]  Kathleen M. Carley,et al.  On Predicting Geolocation of Tweets Using Convolutional Neural Networks , 2017, SBP-BRiMS.

[4]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[5]  Isabelle Augenstein,et al.  emoji2vec: Learning Emoji Representations from their Description , 2016, SocialNLP@EMNLP.

[6]  Huan Liu,et al.  Mining Disinformation and Fake News: Concepts, Methods, and Recent Advancements , 2020, Lecture Notes in Social Networks.

[7]  Michalis Vazirgiannis,et al.  How COVID-19 Is Changing Our Language : Detecting Semantic Shift in Twitter Word Embeddings , 2021, ArXiv.

[8]  Roberto Navigli,et al.  Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities , 2016, Artif. Intell..

[9]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[10]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[11]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[12]  Themis Palpanas,et al.  Where has this tweet come from? Fast and fine-grained geolocalization of non-geotagged tweets , 2016, Social Network Analysis and Mining.

[13]  Konstantin Kobs,et al.  Emote-Controlled Obtaining Implicit Viewer Feedback through Emote based Sentiment Analysis on Comments of Popular Twitch.tv Channels , 2020 .

[14]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[15]  David Allen,et al.  Geotagging one hundred million Twitter accounts with total variation minimization , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[16]  Derek Ruths,et al.  Geolocation Prediction in Twitter Using Social Networks: A Critical Analysis and Review of Current Practice , 2015, ICWSM.

[17]  Meng Zhang,et al.  Neural Network Methods for Natural Language Processing , 2017, Computational Linguistics.

[18]  Pedro Henrique Arruda Faustini,et al.  Fake news detection in multiple platforms and languages , 2020, Expert Syst. Appl..

[19]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[20]  Benjamin Lecouteux,et al.  FlauBERT: Unsupervised Language Model Pre-training for French , 2020, LREC.

[21]  David Jurgens,et al.  That's What Friends Are For: Inferring Location in Online Social Media Platforms Based on Social Relationships , 2013, ICWSM.

[22]  Yin-Fu Huang,et al.  Fake news detection using an ensemble learning model based on Self-Adaptive Harmony Search algorithms , 2020, Expert Syst. Appl..

[23]  Mark Dredze,et al.  Are All Languages Created Equal in Multilingual BERT? , 2020, REPL4NLP.

[24]  José Camacho-Collados,et al.  How Gender and Skin Tone Modifiers Affect Emoji Semantics in Twitter , 2018, *SEMEVAL.

[25]  Timothy Baldwin,et al.  Text-Based Twitter User Geolocation Prediction , 2014, J. Artif. Intell. Res..

[26]  Bilal Tahir,et al.  ProSOUL: A Framework to Identify Propaganda From Online Urdu Content , 2020, IEEE Access.

[27]  A. Azzouz 2011 , 2020, City.

[28]  Mark Dredze,et al.  Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT , 2019, EMNLP.

[29]  Gerard de Melo,et al.  A Robust Self-Learning Framework for Cross-Lingual Text Classification , 2019, EMNLP.

[30]  P. Lafon Sur la variabilité de la fréquence des formes dans un corpus , 1980 .

[31]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..