Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using Twitter

We present Storywrangler, an interactive cultural exploratorium of phrase popularity using 100 billion tweets in 100 languages. In real time, Twitter strongly imprints world events, popular culture, and the day-to-day, recording an ever-growing compendium of language change. Vitally, and absent from many standard corpora such as books and news archives, Twitter also encodes popularity and spreading through retweets. Here, we describe Storywrangler, an ongoing curation of over 100 billion tweets containing 1 trillion 1-grams from 2008 to 2021. For each day, we break tweets into 1-, 2-, and 3-grams across 100+ languages, generating frequencies for words, hashtags, handles, numerals, symbols, and emojis. We make the dataset available through an interactive time series viewer and as downloadable time series and daily distributions. Although Storywrangler leverages Twitter data, our method of tracking dynamic changes in n-grams can be extended to any temporally evolving corpus. Illustrating the instrument’s potential, we present example use cases including social amplification, the sociotechnical dynamics of famous individuals, box office success, and social unrest.

[1]  D. R. Dewhurst,et al.  Fame and Ultrafame: Measuring and comparing daily levels of 'being talked about' for United States' presidents, their rivals, God, countries, and K-pop , 2019, Journal of Quantitative Description: Digital Media.

[2]  Matteo Iacoviello,et al.  Measuring Geopolitical Risk , 2018, International Finance Discussion Papers.

[3]  Thayer Alshaabi,et al.  The growing amplification of social media: measuring temporal and social contagion dynamics for over 150 languages on Twitter for 2009–2020 , 2021, EPJ Data Sci..

[4]  A. J. Reagan,et al.  Computational timeline reconstruction of the stories surrounding Trump: Story turbulence, narrative control, and collective chronopathy , 2020, PloS one.

[5]  C. M. Danforth,et al.  How the world’s collective attention is being paid to a pandemic: COVID-19 related n-gram time series for 24 languages on Twitter , 2020, PloS one.

[6]  Jürgen Pfeffer,et al.  Population Bias in Geotagged Tweets , 2015, Proceedings of the International AAAI Conference on Web and Social Media.

[7]  Nancy C. Wallis Book Review: Learning to Speak God from Scratch: Why Sacred Words Are Vanishing—and how We Can Revive Them , 2020 .

[8]  P. S. Dodds,et al.  Long-term word frequency dynamics derived from Twitter are corrupted: A bespoke approach to detecting and removing pathologies in ensembles of time series , 2020, ArXiv.

[9]  Daniel Y. Fu,et al.  Analyzing Who and What Appears in a Decade of US Cable TV News , 2020, ArXiv.

[10]  Dan Jurafsky,et al.  Racial disparities in automated speech recognition , 2020, Proceedings of the National Academy of Sciences.

[11]  C. M. Danforth,et al.  Allotaxonometry and rank-turbulence divergence: A universal instrument for comparing complex systems. , 2020, 2002.09770.

[12]  Duncan Watts,et al.  Evaluating the fake news problem at the scale of the information ecosystem , 2019, Science Advances.

[13]  Thayer Alshaabi,et al.  The shocklet transform: a decomposition method for the identification of local, mechanism-driven dynamics in sociotechnical time series , 2019, EPJ Data Science.

[14]  Martin Gerlach,et al.  A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics , 2018, Entropy.

[15]  Damon McCoy,et al.  SMAT: The Social Media Analysis Toolkit , 2020, ICWSM Workshops.

[16]  Deb Roy,et al.  RadioTalk: a large-scale corpus of talk radio transcripts , 2019, INTERSPEECH.

[17]  Gábor Vattay,et al.  Scaling in words on Twitter , 2019, Royal Society Open Science.

[18]  Fred Morstatter,et al.  Tampering with Twitter’s Sample API , 2018, EPJ Data Science.

[19]  Quan-Hoang Vuong,et al.  Healthcare consumers’ sensitivity to costs: a reflection on behavioural economics from an emerging market , 2018, Palgrave Communications.

[20]  Aixin Sun,et al.  A Survey of Location Prediction on Twitter , 2017, IEEE Transactions on Knowledge and Data Engineering.

[21]  Jonathan Mellon,et al.  Twitter and Facebook are not representative of the general population: Political attitudes and demographics of British social media users , 2017 .

[22]  F. Arnaud,et al.  From core referencing to data re-use: two French national initiatives to reinforce paleodata stewardship (National Cyber Core Repository and LTER France Retro-Observatory) , 2017 .

[23]  Qing Ke,et al.  A systematic identification and analysis of scientists on Twitter , 2016, PloS one.

[24]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[25]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[26]  Alexander Koplenig,et al.  The impact of lacking metadata for the measurement of cultural and linguistic change using the Google Ngram data sets - Reconstructing the composition of the German corpus in times of WWII , 2015, Digit. Scholarsh. Humanit..

[27]  Christopher M. Danforth,et al.  Is language evolution grinding to a halt? The scaling of lexical turbulence in English fiction suggests it is not , 2015, J. Comput. Sci..

[28]  Brendan T. O'Connor,et al.  Demographic Dialectal Variation in Social Media: A Case Study of African-American English , 2016, EMNLP.

[29]  C. Peirce Prolegomena to an Apology for Pragmaticism , 2016 .

[30]  Desmond Elliott,et al.  A Corpus of Images and Text in Online News , 2016, LREC.

[31]  Shahar Ronen,et al.  Pantheon 1.0, a manually verified dataset of globally famous biographies , 2015, Scientific Data.

[32]  F. Maxwell Harper,et al.  The MovieLens Datasets: History and Context , 2016, TIIS.

[33]  悠太 菊池,et al.  大規模要約資源としてのNew York Times Annotated Corpus , 2015 .

[34]  Alessandro Vespignani,et al.  Online social networks and offline protest , 2015, EPJ Data Science.

[35]  Yoram Bachrach,et al.  Studying User Income through Language, Behaviour and Affect in Social Media , 2015, PloS one.

[36]  Christopher M. Danforth,et al.  Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution , 2015, PloS one.

[37]  James P. Bagrow,et al.  Zipf’s law holds for phrases, not words , 2014, Scientific Reports.

[38]  Shari Laster,et al.  The American Presidency Project , 2014 .

[39]  S. Piantadosi Zipf’s word frequency law in natural language: A critical review and future directions , 2014, Psychonomic Bulletin & Review.

[40]  Andrew Gelman,et al.  The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo , 2011, J. Mach. Learn. Res..

[41]  E. Chenoweth,et al.  Drop Your Weapons When and Why Civil Resistance Works , 2014 .

[42]  Timothy R. Tangherlini,et al.  Trawling in the Sea of the Great Unread: Sub-corpus topic modeling and Humanities research , 2013 .

[43]  Derek Ruths,et al.  Classifying Political Orientation on Twitter: It's Not Easy! , 2013, ICWSM.

[44]  D. Ruths,et al.  What's in a Name? Using First Names as Features for Gender Inference in Twitter , 2013, AAAI Spring Symposium: Analyzing Microtext.

[45]  James Abello,et al.  Computational folkloristics , 2012, Commun. ACM.

[46]  L. Armytage History and context , 2012 .

[47]  Muhammad Atif Qureshi,et al.  What Do the Average Twitterers Say: A Twitter Model for Public Opinion Analysis in the Face of Major Political Events , 2011, 2011 International Conference on Advances in Social Networks Analysis and Mining.

[48]  Soon Ae Chun,et al.  Proceedings of the 12th Annual International Digital Government Research Conference: Digital Government Innovation in Challenging Times , 2011 .

[49]  Sounman Hong,et al.  Does the early bird move the polls?: the use of the social media tool 'Twitter' by U.S. politicians and its impact on public opinion , 2011, dg.o '11.

[50]  Huiji Gao,et al.  Harnessing the Crowdsourcing Power of Social Media for Disaster Relief , 2011, IEEE Intelligent Systems.

[51]  Heather Christenson Hathitrust: A research library at web Scale , 2011 .

[52]  Erez Lieberman Aiden,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010, Science.

[53]  Alex Pentland,et al.  Time-Critical Social Mobilization , 2010, Science.

[54]  J. Bohannon Digital data. Google opens books to new cultural studies. , 2010, Science.

[55]  Jure Leskovec,et al.  Proceedings of the First Workshop on Social Media Analytics , 2010, KDD 2010.

[56]  Aron Culotta,et al.  Towards detecting influenza epidemics by analyzing Twitter messages , 2010, SOMA '10.

[57]  Nello Cristianini,et al.  Tracking the flu pandemic by monitoring the social web , 2010, 2010 2nd International Workshop on Cognitive Information Processing.

[58]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[59]  R. Primack,et al.  The impact of climate change on cherry trees and other species in Japan , 2009 .

[60]  R. Primack,et al.  Conservation and management of biodiversity in Japan: An introduction , 2009 .

[61]  H. Varian,et al.  Predicting the Present with Google Trends , 2009 .

[62]  Barbara S. Dunham Library Resources & : " '-. Technical Service · s , 2008 .

[63]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[64]  Wallace Koehler,et al.  Information science as "Little Science":The implications of a bibliometric analysis of theJournal of the American Society for Information Science , 2001, Scientometrics.

[65]  Chris Brew,et al.  Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics - Volume 1 , 2002 .

[66]  David M. W. Powers,et al.  Applications and Explanations of Zipf’s Law , 1998, CoNLL.

[67]  J. Leiza,et al.  A Critical Review and Future Directions , 1997 .

[68]  D. Rubin,et al.  Inference from Iterative Simulation Using Multiple Sequences , 1992 .

[69]  A. Kellerman,et al.  The Constitution of Society : Outline of the Theory of Structuration , 2015 .

[70]  Derek de Solla Price,et al.  A general theory of bibliometric and other cumulative advantage processes , 1976, J. Am. Soc. Inf. Sci..

[71]  B. M. Hill,et al.  A Simple General Approach to Inference About the Tail of a Distribution , 1975 .

[72]  中村 宏 Santa Barbara移植会議(見聞記) , 1967 .

[73]  H. Simon,et al.  ON A CLASS OF SKEW DISTRIBUTION FUNCTIONS , 1955 .

[74]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[75]  The Monist , 1901 .