Addressing Selection Bias in Event Studies with General-Purpose Social Media Panels

Data from Twitter have been employed in prior research to study the impacts of events. Conventionally, researchers use keyword-based samples of tweets to create a panel of Twitter users who mention event-related keywords during and after an event. However, the keyword-based sampling is limited in its objectivity dimension of data and information quality. First, the technique suffers from selection bias since users who discuss an event are already more likely to discuss event-related topics beforehand. Second, there are no viable control groups for comparison to a keyword-based sample of Twitter users. We propose an alternative sampling approach to construct panels of users defined by their geolocation. Geolocated panels are exogenous to the keywords in users’ tweets, resulting in less selection bias than the keyword panel method. Geolocated panels allow us to follow within-person changes over time and enable the creation of comparison groups. We compare different panels in two real-world settings: response to mass shootings and TV advertising. We first show the strength of the selection biases of keyword panels. Then, we empirically illustrate how geolocated panels reduce selection biases and allow meaningful comparison groups regarding the impact of the studied events. We are the first to provide a clear, empirical example of how a better panel selection design, based on an exogenous variable such as geography, both reduces selection bias compared to the current state of the art and increases the value of Twitter research for studying events. While we advocate for the use of a geolocated panel, we also discuss its weaknesses and application scenario seriously. This article also calls attention to the importance of selection bias in impacting the objectivity of social media data.

[1]  E. Suchman AN ANALYSIS OF “BIAS” IN SURVEY RESEARCH , 1962 .

[2]  J. Heckman Sample Selection Bias as a Specification Error (with an Application to the Estimation of Labor Supply Functions) , 1977 .

[3]  J. Heckman Sample selection bias as a specification error , 1979 .

[4]  B. Geddes How the Cases You Choose Affect the Answers You Get: Selection Bias in Comparative Politics , 1990, Political Analysis.

[5]  Christopher Winship,et al.  Models for Sample Selection Bias , 1992 .

[6]  R. Little Post-Stratification: A Modeler's Perspective , 1993 .

[7]  Robert O. Keohane,et al.  Designing Social Inquiry: Scientific Inference in Qualitative Research. , 1995 .

[8]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[9]  Carl-Erik Särndal,et al.  Model Assisted Survey Sampling , 1997 .

[10]  Marta C. González,et al.  Understanding individual human mobility patterns , 2008, Nature.

[11]  Joshua D. Angrist,et al.  Mostly Harmless Econometrics: An Empiricist's Companion , 2008 .

[12]  Ron Kohavi,et al.  Responsible editor: R. Bayardo. , 2022 .

[13]  Albert-László Barabási,et al.  Understanding individual human mobility patterns , 2008, Nature.

[14]  Jeremy Ginsberg,et al.  Detecting influenza epidemics using search engine query data , 2009, Nature.

[15]  Matthias Schonlau,et al.  Selection Bias in Web Surveys and the Use of Propensity Scores , 2006 .

[16]  Brendan T. O'Connor,et al.  From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series , 2010, ICWSM.

[17]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[18]  Brian J. Taylor,et al.  Causal discovery in social media using quasi-experimental designs , 2010, SOMA '10.

[19]  Nitin Agarwal,et al.  Information quality challenges in social media , 2010, ICIQ.

[20]  Sune Lehmann,et al.  Understanding the Demographics of Twitter Users , 2011, ICWSM.

[21]  Murphy Choy,et al.  A sentiment analysis of Singapore Presidential Election 2011 using Twitter data with census correction , 2011, ArXiv.

[22]  Daniel Gayo-Avello Don't turn social media into another 'Literary Digest' poll , 2011, Commun. ACM.

[23]  Alberto Maria Segre,et al.  The Use of Twitter to Track Levels of Disease Activity and Public Concern in the U.S. during the Influenza A H1N1 Pandemic , 2011, PloS one.

[24]  Jure Leskovec,et al.  Friendship and mobility: user movement in location-based social networks , 2011, KDD.

[25]  Ed H. Chi,et al.  Tweets from Justin Bieber's heart: the dynamics of the location field in user profiles , 2011, CHI.

[26]  Mike Thelwall,et al.  Sentiment in Twitter events , 2011, J. Assoc. Inf. Sci. Technol..

[27]  Fahad Bin Muhaya,et al.  Estimating Twitter User Location Using Social Interactions--A Content Based Approach , 2011, 2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing.

[28]  Rich Ling,et al.  The Activation of Core Social Networks in the Wake of the 22 July Oslo Bombing , 2012, 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining.

[29]  Michael D. Barnes,et al.  "Right Time, Right Place" Health Communication on Twitter: Value and Accuracy of Location Information , 2012, Journal of medical Internet research.

[30]  English Emmenegger Patrick Dunning, Thad (2012): Natural Experiments in the Social Sciences: A Design­Based Approach. Cambridge: , 2012 .

[31]  Ee-Peng Lim,et al.  Tweets and Votes: A Study of the 2011 Singapore General Election , 2012, 2012 45th Hawaii International Conference on System Sciences.

[32]  Ciro Cattuto,et al.  Dynamical classes of collective attention in twitter , 2011, WWW.

[33]  Johan Bos,et al.  Predicting the 2011 Dutch Senate Election Results with Twitter , 2012 .

[34]  Mor Naaman,et al.  Unfolding the event landscape on twitter: classification and exploration of user categories , 2012, CSCW '12.

[35]  T. Dunning Natural Experiments in the Social Sciences: A Design-Based Approach , 2012 .

[36]  Yiqun Liu,et al.  Discover breaking events with popular hashtags in twitter , 2012, CIKM.

[37]  Huan Liu,et al.  Is the Sample Good Enough? Comparing Data from Twitter's Streaming API with Twitter's Firehose , 2013, ICWSM.

[38]  J. Bollen,et al.  More Tweets, More Votes: Social Media as a Quantitative Indicator of Political Behavior , 2013, PloS one.

[39]  Joemon M. Jose,et al.  Building a large-scale corpus for evaluating event detection on twitter , 2013, CIKM.

[40]  Guy G. Gable,et al.  Information Quality in Social Media: A Conceptual Model , 2013, PACIS.

[41]  David Lazer,et al.  Voices of victory: a computational focus group framework for tracking opinion shift in real time , 2013, WWW '13.

[42]  Wolfgang Nejdl,et al.  Understanding the diversity of tweets in the time of outbreaks , 2013, WWW.

[43]  Venkata Rama Kiran Garimella,et al.  Secular vs. Islamist polarization in Egypt on Twitter , 2013, 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013).

[44]  Trevor Cohn,et al.  A user-centric model of voting intention from Social Media , 2013, ACL.

[45]  Ingmar Weber,et al.  U.S. Religious Landscape on Twitter , 2014, SocInfo.

[46]  Bo Thiesson,et al.  Discussion Graphs: Putting Social Media Analysis in Context , 2014, ICWSM.

[47]  Dyng Au,et al.  Can Television Advertising Impact Be Measured on the Web? Web Spike Response as a Possible Conversion Tracking System for Television , 2014, ADKDD'14.

[48]  Themis Palpanas,et al.  Dynamics of news events and social media reaction , 2014, KDD.

[49]  Fernando Diaz,et al.  CrisisLex: A Lexicon for Collecting and Filtering Microblogged Communications in Crises , 2014, ICWSM.

[50]  Chenliang Li,et al.  Fine-grained location extraction from tweets with temporal awareness , 2014, SIGIR.

[51]  Zeynep Tufekci,et al.  Big Questions for Social Media Big Data: Representativeness, Validity and Other Methodological Pitfalls , 2014, ICWSM.

[52]  Alice H. Oh,et al.  Sociolinguistic analysis of Twitter in multilingual societies , 2014, HT.

[53]  Andreas Jungherr Analyzing Political Communication with Digital Trace Data: The Role of Twitter Messages in Social Science Research , 2015 .

[54]  Jisun An,et al.  Whom should we sense in “social sensing” - analyzing which users work best for social media now-casting , 2015, EPJ Data Science.

[55]  Derek Ruths,et al.  Geolocation Prediction in Twitter Using Social Networks: A Critical Analysis and Review of Current Practice , 2015, ICWSM.

[56]  K. Ochsner,et al.  Sadness Shifts to Anxiety Over Time and Distance From the National Tragedy in Newtown, Connecticut , 2015, Psychological science.

[57]  L. Keele,et al.  Geographic Boundaries as Regression Discontinuities , 2015, Political Analysis.

[58]  D. Watts,et al.  Dissecting the Spirit of Gezi: Influence vs. Selection in the Occupy Gezi Movement. , 2015 .

[59]  Mor Naaman,et al.  On the Accuracy of Hyper-local Geotagging of Social Media Content , 2014, WSDM.

[60]  Mor Naaman,et al.  Understanding Musical Diversity via Online Social Media , 2015, ICWSM.

[61]  Claudia Hauff,et al.  Twitter-based Election Prediction in the Developing World , 2015, HT.

[62]  David M. Rothschild,et al.  Forecasting elections with non-representative polls , 2015 .

[63]  Sarah Vieweg,et al.  Processing Social Media Messages in Mass Emergency , 2014, ACM Comput. Surv..

[64]  Amir Goldberg In defense of forensic social science , 2015 .

[65]  C. Puschmann Analyzing political communication with digital trace data: the role of twitter messages in social science research , 2016 .

[66]  Michael Gamon,et al.  Online and Social Media Data As an Imperfect Continuous Panel Survey , 2016, PloS one.

[67]  Han Zhang Witnessing Political Protest on Civic Engagement ∗ , 2016 .

[68]  Roger Blake,et al.  From Content to Context , 2017, ACM J. Data Inf. Qual..