TBCOV: Two Billion Multilingual COVID-19 Tweets with Sentiment, Entity, Geo, and Gender Labels

As the world struggles with several compounded challenges caused by the COVID-19 pandemic in the health, economic, and social domains, timely access to disaggregated national and sub-national data are important to understand the emergent situation but it is difficult to obtain. The widespread usage of social networking sites, especially during mass convergence events, such as health emergencies, provides instant access to citizen-generated data offering rich information about public opinions, sentiments, and situational updates useful for authorities to gain insights. We offer a large-scale social sensing dataset comprising two billion multilingual tweets posted from 218 countries by 87 million users in 67 languages. We used state-of-the-art machine learning models to enrich the data with sentiment labels and named-entities. Additionally, a gender identification approach is proposed to segregate user gender. Furthermore, a geolocalization approach is devised to geotag tweets at country, state, county, and city granularities, enabling a myriad of data analysis tasks to understand real-world issues at national and sub-national levels. We believe this multilingual data with broader geographical and longer temporal coverage will be a cornerstone for researchers to study impacts of the ongoing global health catastrophe and to manage adverse consequences related to people’s health, livelihood, and social well-being.

[1]  Ziqing Feng,et al.  The Impact of Individual Behaviors and Governmental Guidance Measures on Pandemic-Triggered Public Sentiment Based on System Dynamics and Cross-Validation , 2021, International journal of environmental research and public health.

[2]  Hoang Dieu Vu,et al.  COVID-19 Discourse on Twitter in Four Asian Countries: Case Study of Risk Communication , 2021, Journal of medical Internet research.

[3]  O. S. Albahri,et al.  Sentiment analysis and its applications in fighting COVID-19 and infectious diseases: A systematic review , 2020, Expert Systems with Applications.

[4]  R. Grace Toponym usage in social media in emergencies , 2020 .

[5]  Song Gao,et al.  Multiscale dynamic human mobility flow dataset in the U.S. during the COVID-19 epidemic , 2020, Scientific Data.

[6]  Muhammad Imran,et al.  GeoCoV19 , 2020, SIGSPATIAL Special.

[7]  Muhammad Imran,et al.  GeoCoV19 , 2020, ACM SIGSPATIAL Special.

[8]  Tamer Elsayed,et al.  ArCOV-19: The First Arabic COVID-19 Twitter Dataset with Propagation Networks , 2020, WANLP.

[9]  Ahmad Alhindi,et al.  Large Arabic Twitter Dataset on COVID-19 , 2020, ArXiv.

[10]  Juan M. Banda,et al.  A Large-Scale COVID-19 Twitter Chatter Dataset for Open Scientific Research—An International Collaboration , 2020, Epidemiolgia.

[11]  Maged N. Kamel Boulos,et al.  Geographical tracking and mapping of coronavirus disease COVID-19/severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) epidemic and associated events around the world: how 21st century GIS technologies are supporting the global fight against outbreaks and epidemics , 2020, International Journal of Health Geographics.

[12]  Myle Ott,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[13]  Ramesh Sharda,et al.  Social Media for Nowcasting Flu Activity: Spatio-Temporal Big Data Analysis , 2019, Information Systems Frontiers.

[14]  Chenliang Li,et al.  A Survey on Deep Learning for Named Entity Recognition , 2018, IEEE Transactions on Knowledge and Data Engineering.

[15]  Muhammad Imran,et al.  From Situational Awareness to Actionability , 2018, Proc. ACM Hum. Comput. Interact..

[16]  Prabaharan Poornachandran,et al.  Code-Mixing: A Brief Survey , 2018, 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[17]  Weitong Chen,et al.  A survey of sentiment analysis in social media , 2018, Knowledge and Information Systems.

[18]  Ara Darzi,et al.  Sentiment Analysis of Health Care Tweets: Review of the Methods Used , 2018, JMIR public health and surveillance.

[19]  Shuai Wang,et al.  Deep learning for sentiment analysis: A survey , 2018, WIREs Data Mining Knowl. Discov..

[20]  F. Fiedrich,et al.  Opportunities provided by geographic information systems and volunteered geographic information for a timely emergency response during flood events in Cologne, Germany , 2017, Natural Hazards.

[21]  Yunan Chen,et al.  Managing Uncertainty: Using Social Media for Risk Assessment during a Public Health Crisis , 2017, CHI.

[22]  Dieter Pfoser,et al.  Zika in Twitter: Temporal Variations of Locations, Actors, and Concepts , 2017, JMIR public health and surveillance.

[23]  Kathleen M. Carley,et al.  Crowd sourcing disaster management: The complex nature of Twitter usage in Padang Indonesia , 2016 .

[24]  Yan Jin,et al.  Social Media Use During Disasters , 2016, Commun. Res..

[25]  Billy Haworth,et al.  Emergency management perspectives on volunteered geographic information: Opportunities, challenges and change , 2016, Comput. Environ. Urban Syst..

[26]  B. Jiang,et al.  Spatial Distribution of City Tweets and Their Densities , 2016, Urban Remote Sensing.

[27]  A. Galinsky,et al.  The voiced pronunciation of initial phonemes predicts the gender of names. , 2016, Journal of personality and social psychology.

[28]  P. Meier Big (Crisis) Data , 2016 .

[29]  Weiru Liu,et al.  A survey of location inference techniques on Twitter , 2015, J. Inf. Sci..

[30]  Matthew Leighton Williams,et al.  Cyber Hate Speech on Twitter: An Application of Machine Classification and Statistical Modeling for Policy and Decision Making , 2015 .

[31]  Shintaro Okazaki,et al.  Using Twitter to engage with customers: a data mining approach , 2015, Internet Res..

[32]  M. Williams,et al.  Who Tweets? Deriving the Demographic Characteristics of Age, Occupation and Social Class from Twitter User Meta-Data , 2015, PloS one.

[33]  M. Manierre,et al.  Gaps in knowledge: tracking and explaining gender differences in health information seeking. , 2015, Social science & medicine.

[34]  Walaa Medhat,et al.  Sentiment analysis algorithms and applications: A survey , 2014 .

[35]  David Tuffley,et al.  The Gender Digital Divide in Developing Countries , 2014, Future Internet.

[36]  Muhammad Imran,et al.  Understanding Types of Users on Twitter , 2014, ArXiv.

[37]  Giuseppe Porro,et al.  Every tweet counts? How sentiment analysis of social media can improve our knowledge of citizens’ political preferences with an application to Italy and France , 2013, New Media Soc..

[38]  Bin Jiang,et al.  The Evolution of Natural Cities from the Perspective of Location-Based Social Media , 2014, Digital Social Networks and Travel Behaviour in Urban Environments.

[39]  Michael J. Paul,et al.  National and Local Influenza Surveillance through Twitter: An Analysis of the 2012-2013 Influenza Epidemic , 2013, PloS one.

[40]  Mónica Marrero,et al.  Named Entity Recognition: Fallacies, challenges and opportunities , 2013, Comput. Stand. Interfaces.

[41]  Barry Smyth,et al.  CatStream: categorising tweets for user profiling and stream filtering , 2013, IUI '13.

[42]  B. Jiang Head/Tail Breaks: A New Classification Scheme for Data with a Heavy-Tailed Distribution , 2012, 1209.2801.

[43]  Bing Liu,et al.  Sentiment Analysis and Opinion Mining , 2012, Synthesis Lectures on Human Language Technologies.

[44]  John Hannon,et al.  Recommending twitter users to follow using content and collaborative filtering approaches , 2010, RecSys '10.

[45]  Brendan T. O'Connor,et al.  From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series , 2010, ICWSM.

[46]  Leysia Palen,et al.  Chatter on the red: what hazards threat reveals about the social life of microblogged information , 2010, CSCW '10.

[47]  Christopher D. Manning,et al.  Nested Named Entity Recognition , 2009, EMNLP.

[48]  Satoshi Sekine,et al.  Named entities : recognition, classification and use , 2009 .

[49]  Joy L. Johnson,et al.  International Journal for Equity in Health Open Access Better Science with Sex and Gender: Facilitating the Use of a Sex and Gender-based Analysis in Health Research Sex and Gender in Health Research , 2022 .

[50]  Patrick Weber,et al.  OpenStreetMap: User-Generated Street Maps , 2008, IEEE Pervasive Computing.

[51]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[52]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[53]  Luis Espinosa Anke,et al.  XLM-T: A Multilingual Language Model Toolkit for Twitter , 2021, ArXiv.

[54]  Changxiu Cheng,et al.  Temporal and Spatial Evolution and Influencing Factors of Public Sentiment in Natural Disasters - A Case Study of Typhoon Haiyan , 2021, ISPRS Int. J. Geo Inf..

[55]  Antonio Jimeno-Yepes,et al.  Detection of adverse drug reactions using medical named entities on Twitter , 2017, AMIA.

[56]  Kevin A Padrez,et al.  Twitter as a Tool for Health Research: A Systematic Review , 2017, American journal of public health.

[57]  Ghazaleh Beigi,et al.  An Overview of Sentiment Analysis in Social Media and Its Applications in Disaster Relief , 2016, Sentiment Analysis and Ontology Engineering.

[58]  A. Rieder,et al.  Methodologic and ethical ramifications of sex and gender differences in public health research. , 2007, Gender medicine.

[59]  Thomas J. Watson,et al.  An empirical study of the naive Bayes classifier , 2001 .

[60]  Panagiotis Stamatopoulos,et al.  RULE-BASED NAMED ENTITY RECOGNITION FOR GREEK FINANCIAL TEXTS , 2000 .

[61]  A. Winsor Sampling techniques. , 2000, Nursing times.