Evaluating the Representativeness of Socio-Demographic Variables over Time for Geo-Social Media Data

Geo-social media data are widely used as a data source to model populations and processes in a variety of contexts. However, if the data do not adequately represent the population they are drawn from, analysis results will be biased. Unaddressed, these biases may lead to false interpretations and conclusions. In this paper, we propose a generic methodology for investigating the representativeness of geo-social media data for population groups of similar statistical predictive power based on reference data. The groups are designed to be spatially coherent regions with similar prediction errors. Based on these units, we investigate the influence of different socio-demographic covariates on the representativeness. We perform experiments based on over 1.6 billion tweets and 90 socio-demographic covariates. We demonstrate that Twitter data representativeness varies strongly over time and space. Our results show that densely populated areas tend to be underrepresented consistently in non-spatial models. Over time, some covariates like the number of people aged 20 years exhibit highly different effects on the prediction models, whereas others are much more stable. The spatial effects can most frequently be explained using spatial error models, indicating spatially related errors that indicate the necessity of additional covariates. Finally, we provide hints for interpreting the results of our approach for researchers using the concepts presented in this paper.

[1]  Sune Lehmann,et al.  Understanding the Demographics of Twitter Users , 2011, ICWSM.

[2]  M. Strube,et al.  Citizen-Centric Urban Planning through Extracting Emotion Information from Twitter in an Interdisciplinary Space-Time-Linguistics Algorithm , 2016 .

[3]  Yihong Yuan,et al.  Evaluating gender representativeness of location-based social media: a case study of Weibo , 2018, Ann. GIS.

[4]  Bernd Resch,et al.  Privacy Threats and Protection Recommendations for the Use of Geosocial Network Data in Research , 2018, Social Sciences.

[5]  Hans-Peter Kriegel,et al.  DBSCAN Revisited, Revisited , 2017, ACM Trans. Database Syst..

[6]  M. Williams,et al.  Who Tweets? Deriving the Demographic Characteristics of Age, Occupation and Social Class from Twitter User Meta-Data , 2015, PloS one.

[7]  Carlo Ratti,et al.  Geo-located Twitter as proxy for global mobility patterns , 2013, Cartography and geographic information science.

[8]  L. Chapman,et al.  Investigating the Emotional Responses of Individuals to Urban Green Space Using Twitter Data: A Critical Comparison of Three Different Methods of Sentiment Analysis , 2018 .

[9]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[10]  Roger Bivand,et al.  Comparing Implementations of Estimation Methods for Spatial Econometrics , 2015 .

[11]  Michael Leitner,et al.  Population at risk: using areal interpolation and Twitter messages to create population models for burglaries and robberies , 2017, Cartography and geographic information science.

[12]  D. Fink,et al.  Spatiotemporal exploratory models for broad-scale survey data. , 2010, Ecological applications : a publication of the Ecological Society of America.

[13]  C. Havas,et al.  Combining machine-learning topic models and spatiotemporal analysis of social media data for disaster footprint and damage assessment , 2018 .

[14]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[15]  Brent J. Hecht,et al.  A Tale of Cities: Urban Biases in Volunteered Geographic Information , 2014, ICWSM.

[16]  M. Williams,et al.  Knowing the Tweeters: Deriving Sociologically Relevant Demographics from Twitter , 2013 .

[17]  Sanford Weisberg,et al.  An R Companion to Applied Regression , 2010 .

[18]  Pablo Barberá,et al.  Understanding the Political Representativeness of Twitter Users , 2015 .

[19]  A-Xing Zhu,et al.  A representativeness-directed approach to mitigate spatial bias in VGI for the predictive mapping of geographic phenomena , 2019, Int. J. Geogr. Inf. Sci..

[20]  Alexander Zipf,et al.  A geographic approach for combining social media and authoritative data towards identifying useful information for disaster management , 2015, Int. J. Geogr. Inf. Sci..

[21]  Reinhard Riedl,et al.  Bringing Representativeness into Social Media Monitoring and Analysis , 2013, 2013 46th Hawaii International Conference on System Sciences.

[22]  Birgit Kirsch,et al.  E2mC: Improving Emergency Management Service Practice through Social Media and Crowdsourcing Analysis in Near Real Time , 2017, Sensors.

[23]  J. Shao Linear Model Selection by Cross-validation , 1993 .

[24]  Alok N. Choudhary,et al.  Real-time disease surveillance using Twitter data: demonstration on flu and cancer , 2013, KDD.

[25]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[26]  Michael F. Goodchild,et al.  The convergence of GIS and social media: challenges for GIScience , 2011, Int. J. Geogr. Inf. Sci..

[27]  Christoph Lutz,et al.  Representativeness of Social Media in Great Britain: Investigating Facebook, LinkedIn, Twitter, Pinterest, Google+, and Instagram , 2017 .

[28]  Bernd Resch,et al.  A Geoprivacy by Design Guideline for Research Campaigns That Use Participatory Sensing Data , 2018, Journal of empirical research on human research ethics : JERHRE.

[29]  A. Getis The Analysis of Spatial Association by Use of Distance Statistics , 2010 .

[30]  A. Tversky,et al.  Judgment under Uncertainty: Heuristics and Biases , 1974, Science.

[31]  Zeynep Tufekci,et al.  Big Questions for Social Media Big Data: Representativeness, Validity and Other Methodological Pitfalls , 2014, ICWSM.

[32]  Jacob Ratkiewicz,et al.  Predicting the Political Alignment of Twitter Users , 2011, 2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing.

[33]  M. Goodchild,et al.  Spatial, temporal, and socioeconomic patterns in the use of Twitter and Flickr , 2013 .

[34]  S. Lang,et al.  Geons – domain-specific regionalization of space , 2014 .

[35]  Virgilio Gómez-Rubio,et al.  Spatial Point Patterns: Methodology and Applications with R , 2016 .

[36]  A-Xing Zhu,et al.  The representativeness and spatial bias of volunteered geographic information: a review , 2018, Ann. GIS.

[37]  Yingjie Hu,et al.  Understanding the removal of precise geotagging in tweets , 2020, Nature Human Behaviour.

[38]  A. Vespignani,et al.  An early warning approach to monitor COVID-19 activity with multiple digital traces in near real time , 2021, Science advances.

[39]  Alexander Zipf,et al.  Twitter as an indicator for whereabouts of people? Correlating Twitter with UK census data , 2015, Comput. Environ. Urban Syst..

[40]  L. Anselin Spatial Econometrics: Methods and Models , 1988 .

[41]  Peter Zeile,et al.  Urban Emotions - Geo-Semantic Emotion Extraction from Technical Sensors, Human Sensors and Crowdsourced Data , 2014, LBS.

[42]  Guangqing Chi,et al.  Applied Spatial Data Analysis with R , 2015 .

[43]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[44]  Tao Pei,et al.  A citizen data-based approach to predictive mapping of spatial variation of natural phenomena , 2015, Int. J. Geogr. Inf. Sci..

[45]  Wei Wei,et al.  Correlating S&P 500 stocks with Twitter data , 2012, HotSocial '12.

[46]  Cornelia Ferner,et al.  Exploratory Spatiotemporal Language Analysis of Geo-Social Network Data for Identifying Movements of Refugees , 2020 .

[47]  A S Fotheringham,et al.  The Modifiable Areal Unit Problem in Multivariate Statistical Analysis , 1991 .

[48]  Zhenlong Li,et al.  Understanding demographic and socioeconomic biases of geotagged Twitter users at the county level , 2019 .

[49]  Mark Dredze,et al.  Combining Search, Social Media, and Traditional Data Sources to Improve Influenza Surveillance , 2015, PLoS Comput. Biol..