A Tutorial for Using Twitter Data in the Social Sciences: Data Collection, Preparation, and Analysis

The ever increasing use of digital tools and services has led to the emergence of new data sources for social scientists, data wittingly or unwittingly produced by users while interacting with digital tools. The potential of these digital trace data is well-established. Still, in practice, the process of data collection, preparation and storage, and subsequent analysis can provide challenges. With this tutorial, we provide a guide for social scientists to the collection, preparation, and analysis of digital trace data collected on the microblogging service Twitter. This tutorial comes with a set of scripts providing researchers with a starter kit of code allowing them to search, collect, and prepare Twitter data following their specific research interests. We will start with a general discussion of the research process with Twitter data. Following this, we will introduce a set of scripts for data collection on Twitter. After this, we will introduce various scripts for the preparation of data for analysis. We then present a series of examples for typical analyses that could be run with Twitter data. Here, we focus on counts, time series, and networks. We close this tutorial with a discussion of challenges in establishing digital trace data as a normal data source in the social sciences.

[1]  Chris Arney,et al.  Networks, Crowds, and Markets: Reasoning about a Highly Connected World (Easley, D. and Kleinberg, J.; 2010) [Book Review] , 2013, IEEE Technology and Society Magazine.

[2]  Ben Sayre,et al.  Mapping the Political Twitterverse: Candidates and Their Followers in the Midterms , 2021, ICWSM.

[3]  Kevin Makice TWITTER API : UP AND RUNNING , 2009 .

[4]  Ben Shneiderman,et al.  Analyzing Social Media Networks with NodeXL: Insights from a Connected World , 2010 .

[5]  竹安 数博,et al.  Time series analysis and its applications , 2007 .

[6]  Georgios Paltoglou,et al.  Signals of Public Opinion in Online Communication , 2015 .

[7]  Toby Segaran,et al.  Programming Collective Intelligence , 2007 .

[8]  A. Chadwick,et al.  Dual Screening the Political: Media Events, Social Media, and Citizen Engagement , 2015 .

[9]  Bernardo A. Huberman,et al.  The laws of the web - patterns in the ecology of information , 2001 .

[10]  David Lazer,et al.  Voices of victory: a computational focus group framework for tracking opinion shift in real time , 2013, WWW '13.

[11]  Jean Burgess,et al.  The Politics of Twitter Data , 2013 .

[12]  Eric Gossett,et al.  Big Data: A Revolution That Will Transform How We Live, Work, and Think , 2015 .

[13]  Brigid Wilson,et al.  Implementing Reproducible Research , 2014 .

[14]  Andreas Jungherr,et al.  Tweets and votes, a special relationship: the 2009 federal election in germany , 2013, PLEAD '13.

[15]  Markus Strohmaier,et al.  Ieee Intelligent Systems Computational Social Science for the World Wide Web Computational Social Science , 2022 .

[16]  Peter Dalgaard,et al.  R Development Core Team (2010): R: A language and environment for statistical computing , 2010 .

[17]  Stanley Wasserman,et al.  Social Network Analysis: Methods and Applications , 1994, Structural analysis in the social sciences.

[18]  Deen Freelon,et al.  On the Interpretation of Digital Trace Data in Communication and Social Computing Research , 2014 .

[19]  Zed A. Shaw Learn Python 3 the Hard Way: A Very Simple Introduction to the Terrifyingly Beautiful World of Computers and Code , 2017 .

[20]  Kevin Crowston,et al.  Validity Issues in the Use of Social Network Analysis with Digital Trace Data , 2011, J. Assoc. Inf. Syst..

[21]  Anders Olof Larsson,et al.  Studying political microblogging: Twitter users in the 2010 Swedish election campaign , 2012, New Media Soc..

[22]  Jure Leskovec,et al.  The bursty dynamics of the Twitter information network , 2014, WWW.

[23]  Dan Mercea,et al.  Tents, Tweets, and Events: The Interplay Between Ongoing Protests and Social Media , 2015 .

[24]  JungherrAndreas,et al.  Why the Pirate Party Won the German Election of 2009 or The Trouble With Predictions , 2012 .

[25]  G. King,et al.  Ensuring the Data-Rich Future of the Social Sciences , 2011, Science.

[26]  Sandra González-Bailón Social Science in the Era of Big Data , 2013 .

[27]  David A. Shamma,et al.  Conversational Shadows: Describing Live Media Events Using Short Messages , 2010, ICWSM.

[28]  R. Peterson To tweet or not to tweet: Exploring the determinants of early adoption of Twitter by House members in the 111th Congress , 2012 .

[29]  Harald Schoen,et al.  The Mediation of Politics through Twitter: An Analysis of Messages posted during the Campaign for the German Federal Election 2013 , 2016, J. Comput. Mediat. Commun..

[30]  Richard Rogers Debanalizing Twitter: the transformation of an object of study , 2013, WebSci.

[31]  Harald Schoen,et al.  Small worlds with a difference: new gatekeepers and the filtering of political information on Twitter , 2011, WebSci '11.

[32]  Jacob Ratkiewicz,et al.  Political Polarization on Twitter , 2011, ICWSM.

[33]  Sarah J. Jackson,et al.  Hijacking #myNYPD: Social Media Dissent and Networked Counterpublics , 2015 .

[34]  Andreas Jungherr,et al.  The Use of Twitter during the 2009 German National Election , 2015 .

[35]  Steven L. Scott,et al.  Inferring causal impact using Bayesian structural time-series models , 2015, 1506.00356.

[36]  Yamir Moreno,et al.  The Dynamics of Protest Recruitment through an Online Network , 2011, Scientific reports.

[37]  Daniel Kreiss,et al.  Seizing the moment: The presidential campaigns’ use of Twitter during the 2012 electoral cycle , 2016, New Media Soc..

[38]  Daniel Gayo-Avello,et al.  A Meta-Analysis of State-of-the-Art Electoral Prediction From Twitter Data , 2012, ArXiv.

[39]  Hadley Wickham,et al.  Reshaping Data with the reshape Package , 2007 .

[40]  Alessandro Vespignani Modelling dynamical processes in complex socio-technical systems , 2011, Nature Physics.

[41]  Huan Liu,et al.  When is it biased?: assessing the representativeness of twitter's streaming API , 2014, WWW.

[42]  Andreas Jungherr Twitter use in election campaigns: A systematic literature review , 2016 .

[43]  Keith Bradnam,et al.  UNIX and Perl to the Rescue!: A Field Guide for the Life Sciences (and Other Data-rich Pursuits) , 2012 .

[44]  Merja Mahrt,et al.  The Value of Big Data in Digital Media Research , 2013 .

[45]  Jay A. Kreibich Using SQLite - Small. Fast. Reliable. Choose any Three , 2010 .

[46]  George Athanasopoulos,et al.  Forecasting: principles and practice , 2013 .

[47]  Simon Munzert,et al.  Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining , 2014 .

[48]  Robert Kabacoff,et al.  R in Action: Data Analysis and Graphics with R , 2015 .

[49]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[50]  Pablo Barberá,et al.  Understanding the Political Representativeness of Twitter Users , 2015 .

[51]  Huan Liu,et al.  Is the Sample Good Enough? Comparing Data from Twitter's Streaming API with Twitter's Firehose , 2013, ICWSM.

[52]  Norman Matloff,et al.  The Art of R Programming: A Tour of Statistical Software Design , 2011 .

[53]  Duncan J. Watts,et al.  The Structural Virality of Online Diffusion , 2015, Manag. Sci..

[54]  Andreas Jungherr The Logic of Political Coverage on Twitter: Temporal Dynamics and Content , 2014 .

[55]  Dhavan V. Shah,et al.  Big Data, Digital Media, and Computational Social Science , 2015 .

[56]  David Mason,et al.  Digital Methods , 2014, Online Inf. Rev..

[57]  Matthew A. Russell,et al.  Mining the social web , 2011 .

[58]  McKinney Wes,et al.  Python for Data Analysis , 2012 .

[59]  Winston Chang,et al.  R Graphics Cookbook , 2012 .

[60]  Grant Allen,et al.  The Definitive Guide to SQLite , 2006 .

[61]  Panagiotis Takis Metaxas,et al.  How (Not) to Predict Elections , 2011, 2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing.

[62]  Eric D. Kolaczyk,et al.  Statistical Analysis of Network Data: Methods and Models , 2009 .

[63]  Scott Andrew Golder Social Science with Social Media , 2017 .

[64]  Ning Wang,et al.  Networked discontent: The anatomy of protest campaigns in social media , 2016, Soc. Networks.

[65]  Klaus Nordhausen,et al.  Statistical Analysis of Network Data with R , 2015 .

[66]  W. Lance Bennett,et al.  Organization in the crowd: peer production in large-scale networked protests , 2014 .

[67]  David Lazer,et al.  Rising Tides or Rising Stars?: Dynamics of Shared Attention on Twitter during Media Events , 2013, PloS one.

[68]  T. Zeitzoff,et al.  Using Social Media to Measure Conflict Dynamics : An Application to the 2008 – 2009 Gaza Conflict , 2011 .

[69]  D. Ruths,et al.  Social media for large studies of behavior , 2014, Science.

[70]  Andreas Jungherr,et al.  Stuttgart’s Black Thursday on Twitter : Mapping Political Protests with Social Media Data , 2014 .

[71]  C. Puschmann Analyzing political communication with digital trace data: the role of twitter messages in social science research , 2016 .

[72]  R. Viertl On the Future of Data Analysis , 2002 .

[73]  Rob J. Hyndman,et al.  Large-Scale Unusual Time Series Detection , 2015, 2015 IEEE International Conference on Data Mining Workshop (ICDMW).

[74]  Pablo J. Boczkowski,et al.  The Relevance of Algorithms , 2013 .

[75]  Huan Liu,et al.  Twitter Data Analytics , 2013, SpringerBriefs in Computer Science.

[76]  Scott A. Golder,et al.  Digital Footprints: Opportunities and Challenges for Online Social Research , 2014 .

[77]  Malcolm R. Parks Big Data in Communication Research: Its Contents and Discontents , 2014 .

[78]  J. Nadal,et al.  Manifesto of computational social science , 2012 .

[79]  Deen Freelon,et al.  Of big birds and bayonets: hybrid Twitter interactivity in the 2012 Presidential debates , 2015 .

[80]  Mark Newman,et al.  Networks: An Introduction , 2010 .

[81]  David A. Shamma,et al.  Peaks and persistence: modeling the shape of microblog conversations , 2011, CSCW '11.

[82]  Luciano Rossoni,et al.  Models and methods in social network analysis , 2006 .

[83]  D. Lazer,et al.  The Parable of Google Flu: Traps in Big Data Analysis , 2014, Science.

[84]  M. Huberty Can we vote with our tweet? On the perennial difficulty of election forecasting with social media , 2015 .

[85]  Matthew P. Hitt,et al.  Time Series Analysis for the Social Sciences , 2014 .

[86]  Armando Fandango,et al.  Python Data Analysis , 2017 .

[87]  D. Trilling Two Different Debates? Investigating the Relationship Between a Political Debate on TV and Simultaneous Comments on Twitter , 2015 .

[88]  Matt Golder,et al.  Big Data, Causal Inference, and Formal Theory: Contradictory Trends in Political Science? , 2014, PS: Political Science & Politics.

[89]  Michael Gamon,et al.  Online And Social Media Data As A Flawed Continuous Panel Survey , 2014 .

[90]  Christopher M. Danforth,et al.  Happiness and the Patterns of Life: A Study of Geolocated Tweets , 2013, Scientific Reports.

[91]  Deen Freelon On the cutting edge of Big Data: digital politics research in the social computing literature , 2015 .

[92]  Viktor Mayer-Schnberger,et al.  Big Data: A Revolution That Will Transform How We Live, Work, and Think , 2013 .

[93]  W. Lowe,et al.  Using Twitter to mobilize protest action: online mobilization patterns and action repertoires in the Occupy Wall Street, Indignados, and Aganaktismenoi movements , 2015 .

[94]  Claudio Cioffi-Revilla,et al.  Introduction to Computational Social Science: Principles and Applications , 2017 .

[95]  Peter R. Monge,et al.  Theories of Communication Networks , 2003 .

[96]  Pablo Barberá Birds of the Same Feather Tweet Together: Bayesian Ideal Point Estimation Using Twitter Data , 2015, Political Analysis.

[97]  Heather Ford,et al.  Qualitative Political Communication| Trace Interviews: An Actor-Centered Approach , 2015 .

[98]  Lada A. Adamic,et al.  Computational Social Science , 2009, Science.

[99]  N. Anstead,et al.  The Emerging Viewertariat and BBC Question Time , 2011 .

[100]  Abhijit Dasgupta,et al.  Practical Data Science Cookbook , 2014 .

[101]  Peter Maurer,et al.  Partisan alignments and political polarization online: a computational approach to understanding the french and US presidential elections , 2013, PLEAD '13.

[102]  Andreas Jungherr,et al.  Through a Glass, Darkly , 2014 .

[103]  F. Dominici,et al.  Reproducible epidemiologic research. , 2006, American journal of epidemiology.

[104]  Bernhard Rieder,et al.  Programmed method: developing a toolset for capturing and analyzing tweets , 2014, Aslib J. Inf. Manag..

[105]  M. Broersma,et al.  BETWEEN BROADCASTING POLITICAL MESSAGES AND INTERACTING WITH VOTERS , 2012 .

[106]  Andreas Jungherr,et al.  Forecasting the pulse: How deviations from regular patterns in online data can identify offline phenomena , 2013, Internet Res..

[107]  Clay Shirky Here Comes Everybody: The Power of Organizing Without Organizations , 2008 .