Analysis of Geotagging Behavior: Do Geotagged Users Represent the Twitter Population?

Twitter’s APIs are now the main data source for social media researchers. A large number of studies have utilized Twitter data for diverse research interests. Twitter users can share their precise real-time location, and Twitter APIs can provide this information as longitude and latitude. These geotagged Twitter data can help to study human activities and movements for different applications. Compared to the mostly small-scale data samples in different domains, such as social science, collecting geotagged data offers large samples. There is a fundamental question whether geotagged users can represent non-geotagged users. While some studies have investigated the question from different perspectives, they did not investigate profile information and the contents of tweets of geotagged and non-geotagged users. This empirical study addresses this limitation by applying text mining, statistical analysis, and machine learning techniques on Twitter data comprising more than 88,000 users and over 170 million tweets. Our findings show that there is a significant difference (p-value < 0.001) between geotagged and non-geotagged users based on 73% of the features obtained from the users’ profiles and tweets. The features can also help to distinguish between geotagged and non-geotagged users with around 80% accuracy. This research illustrates that geotagged users do not represent the Twitter population.

[1]  Ming Wen,et al.  Building a National Neighborhood Dataset From Geotagged Twitter Data for Indicators of Happiness, Diet, and Physical Activity , 2016, JMIR public health and surveillance.

[2]  A. Karami,et al.  Exploring research trends in big data across disciplines: A text mining analysis , 2020, J. Inf. Sci..

[3]  S. Diallo,et al.  You Are What You Tweet: Connecting the Geographic Variation in America’s Obesity Rate to Twitter Content , 2015, PloS one.

[4]  Luke S Sloan,et al.  Who Tweets with Their Location? Understanding the Relationship between Demographic Characteristics and the Use of Geoservices and Geotagging on Twitter , 2015, PloS one.

[5]  Ehsan Mohammadi,et al.  “Life never matters in the DEMOCRATS MIND”: Examining strategies of retweeted social bots during a mass shooting event , 2018, ASIST.

[6]  Ming Wen,et al.  Twitter-derived neighborhood characteristics associated with obesity and diabetes , 2017, Scientific Reports.

[7]  Xiaoyun He,et al.  Mining Public Opinion about Economic Issues: Twitter and the U.S. Presidential Election , 2018, Int. J. Strateg. Decis. Sci..

[8]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[9]  A. Karami,et al.  Identifying and Analyzing Health-Related Themes in Disinformation Shared by Conservative and Liberal Russian Trolls on Twitter , 2021, International journal of environmental research and public health.

[10]  Hadi Kharrazi,et al.  Seasonal characterization of diet discussions on Reddit , 2020, ASIST.

[11]  A. Karami,et al.  Space identification of sexual harassment reports with text mining , 2020, ASIST.

[12]  I. J. Good,et al.  C140. Standardized tail-area prosabilities , 1982 .

[13]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[14]  Nastaran Pourebrahim,et al.  Understanding communication dynamics on Twitter during natural disasters: A case study of Hurricane Sandy , 2019, International Journal of Disaster Risk Reduction.

[15]  Man-pui Sally Chan,et al.  Associations of Topics of Discussion on Twitter With Survey Measures of Attitudes, Knowledge, and Behaviors Related to Zika: Probabilistic Study in the United States , 2018, JMIR public health and surveillance.

[16]  Amy S. Billing,et al.  Using socially-sensed data to infer ZIP level characteristics for the spatiotemporal analysis of drug-related health problems in Maryland. , 2020, Health & place.

[17]  Constantin F. Aliferis,et al.  A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification , 2008, BMC Bioinformatics.

[18]  A. Karami,et al.  Analyzing health tweets of LGB and transgender individuals , 2020, ASIST.

[19]  Frank O. Ostermann,et al.  Linking Geosocial Sensing with the Socio-Demographic Fabric of Smart Cities , 2021, ISPRS Int. J. Geo Inf..

[20]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[21]  Diansheng Guo,et al.  A novel approach to leveraging social media for rapid flood mapping: a case study of the 2015 South Carolina floods , 2018 .

[22]  Mark Dredze,et al.  The Twitter Social Mobility Index: Measuring Social Distancing Practices from Geolocated Tweets , 2020, ArXiv.

[23]  Birgit Kirsch,et al.  E2mC: Improving Emergency Management Service Practice through Social Media and Crowdsourcing Analysis in Near Real Time , 2017, Sensors.

[24]  Congcong Zhang,et al.  Sentiment, richness, authority, and relevance model of information sharing during social Crises - the case of #MH370 tweets , 2018, Comput. Hum. Behav..

[25]  E. Nsoesie,et al.  Use of social media, search queries, and demographic data to assess obesity prevalence in the United States , 2019, Palgrave Communications.

[26]  Amir Karami,et al.  Social media and COVID‐19: Can social distancing be quantified without measuring human movements? , 2020, Proceedings of the Association for Information Science and Technology. Association for Information Science and Technology.

[27]  M. Williams,et al.  Knowing the Tweeters: Deriving Sociologically Relevant Demographics from Twitter , 2013 .

[28]  Aeilko H. Zwinderman,et al.  Understanding big data themes from scientific biomedical literature through topic modeling , 2016, Journal of Big Data.

[29]  A. Karami,et al.  Investigating diseases and chemicals in COVID-19 literature with text mining , 2021, International Journal of Information Management Data Insights.

[30]  Sruthi Puthan Valappil,et al.  Analysis of Social Media Discussions on (#)Diet by Blue, Red, and Swing States in the U.S. , 2021, Healthcare.

[31]  Encarnación Sánchez Arenas Exploring Pornography in Widad Benmoussa’s Poetry Using LIWC and Corpus Tools , 2018 .

[32]  Yogesh K. Dwivedi,et al.  Twitter and Research: A Systematic Literature Review Through Text Mining , 2020, IEEE Access.

[33]  S. Cutter,et al.  Leveraging Twitter to gauge evacuation compliance: Spatiotemporal analysis of Hurricane Matthew , 2017, PloS one.

[34]  Jiajun Liu,et al.  Understanding Human Mobility from Twitter , 2014, PloS one.

[35]  Daniel T. Kaplan,et al.  The mosaic Package: Helping Students to Think with Data Using R , 2017, R J..

[36]  Craig MacDonald,et al.  Votes on Twitter: Assessing Candidate Preferences and Topics of Discussion During the 2016 U.S. Presidential Election , 2019, SAGE Open.

[37]  Weiru Liu,et al.  A survey of location inference techniques on Twitter , 2015, J. Inf. Sci..

[38]  Jae H. Kim,et al.  Significance Testing in Empirical Finance: A Critical Review and Assessment , 2015 .

[39]  E. Fox,et al.  Applying GIS and Text Mining Methods to Twitter Data to Explore the Spatiotemporal Patterns of Topics of Interest in Kuwait , 2020, ISPRS Int. J. Geo Inf..

[40]  Zhenlong Li,et al.  Topic modeling and sentiment analysis of global climate change tweets , 2019, Social Network Analysis and Mining.

[41]  Amir Karami,et al.  Unwanted Advances in Higher Education: Uncovering Sexual Harassment Experiences in Academia with Text Mining , 2019, Inf. Process. Manag..

[42]  S. Page,et al.  Computational Social Science: Discovery and Prediction. Edited by R. Michael Alvarez. New York: Cambridge University Press, 2016. 337p. $99.99 cloth, $34.99 paper. , 2016, Perspectives on Politics.

[43]  Rachel Gibson,et al.  140 Characters to Victory?: Using Twitter to Predict the UK 2015 General Election , 2015, ArXiv.

[44]  Xiao Huang,et al.  Twitter reveals human mobility dynamics during the COVID-19 pandemic , 2020, PloS one.

[45]  Sepideh Modrek,et al.  The #MeToo Movement in the United States: Text Analysis of Early Twitter Conversations , 2019, Journal of medical Internet research.

[46]  R. Guha,et al.  What are we ‘tweeting’ about obesity? Mapping tweets with topic modeling and Geographic Information System , 2013, Cartography and geographic information science.

[47]  A. Graesser,et al.  Pronoun Use Reflects Standings in Social Hierarchies , 2014 .

[48]  Wenwen Li,et al.  Using geolocated Twitter data to monitor the prevalence of healthy and unhealthy food references across the US , 2014 .

[49]  Robert J. Kauffman,et al.  Understanding the paradigm shift to computational social science in the presence of big data , 2014, Decis. Support Syst..

[50]  Abbas Rajabifard,et al.  A Multi-Element Approach to Location Inference of Twitter: A Case for Emergency Response , 2016, ISPRS Int. J. Geo Inf..