Race, Religion and the City: Twitter Word Frequency Patterns Reveal Dominant Demographic Dimensions in the United States

Recently, numerous approaches have emerged in the social sciences to exploit the opportunities made possible by the vast amounts of data generated by online social networks (OSNs). Having access to information about users on such a scale opens up a range of possibilities, all without the limitations associated with often slow and expensive paper-based polls. A question that remains to be satisfactorily addressed, however, is how demography is represented in the OSN content? Here, we study language use in the US using a corpus of text compiled from over half a billion geo-tagged messages from the online microblogging platform Twitter. Our intention is to reveal the most important spatial patterns in language use in an unsupervised manner and relate them to demographics. Our approach is based on Latent Semantic Analysis (LSA) augmented with the Robust Principal Component Analysis (RPCA) methodology. We find spatially correlated patterns that can be interpreted based on the words associated with them. The main language features can be related to slang use, urbanization, travel, religion and ethnicity, the patterns of which are shown to correlate plausibly with traditional census data. Our findings thus validate the concept of demography being represented in OSN language use and show that the traits observed are inherently present in the word frequencies without any previous assumptions about the dataset. Thus, they could form the basis of further research focusing on the evaluation of demographic data estimation from other big data sources, or on the dynamical processes that result in the patterns found here.

[1]  Filippo Menczer,et al.  Traveling trends: social butterflies or frequent fliers? , 2013, COSN '13.

[2]  A. Tatem,et al.  Dynamic population mapping using mobile phone data , 2014, Proceedings of the National Academy of Sciences.

[3]  János Szüle,et al.  Spatial Fingerprints of Community Structure in Human Interaction Network for an Extensive Set of Large-Scale Regions , 2015, PloS one.

[4]  Manuel Cebrián,et al.  Social Media Fingerprints of Unemployment , 2014, PloS one.

[5]  Carlo Ratti,et al.  Cities through the Prism of People’s Spending Behavior , 2015, PloS one.

[6]  Margaret L. Kern,et al.  Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach , 2013, PloS one.

[7]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[8]  Carlo Ratti,et al.  Towards a comparative science of cities: using mobile traffic records in New York, London and Hong Kong , 2014, ArXiv.

[9]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[10]  Tobias Preis,et al.  Quantifying crowd size with mobile phone and Twitter data , 2015, Royal Society Open Science.

[11]  D. Brockmann,et al.  The Structure of Borders in a Small World , 2010, PLoS ONE.

[12]  Peter Z. Kunszt,et al.  Indexing the Sphere with the Hierarchical Triangular Mesh , 2007, ArXiv.

[13]  Carlo Ratti,et al.  Geo-located Twitter as proxy for global mobility patterns , 2013, Cartography and geographic information science.

[14]  Dino Pedreschi,et al.  Understanding the patterns of car travel , 2013 .

[15]  Brendan T. O'Connor,et al.  From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series , 2010, ICWSM.

[16]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[17]  David Brain From Good Neighborhoods to Sustainable Cities: Social Science and the Social Agenda of the New Urbanism , 2005 .

[18]  R. Mare,et al.  Neighborhood Choice and Neighborhood Change1 , 2006, American Journal of Sociology.

[19]  Soong Moon Kang,et al.  Structure of Urban Movements: Polycentric Activity and Entangled Hierarchical Flows , 2010, PloS one.

[20]  R. Sampson,et al.  Disparity and diversity in the contemporary city: social (dis)order revisited. , 2009, The British journal of sociology.

[21]  Christopher M. Danforth,et al.  The Geography of Happiness: Connecting Twitter Sentiment and Expression, Demographics, and Objective Characteristics of Place , 2013, PloS one.

[22]  G. Sapiro,et al.  A collaborative framework for 3D alignment and classification of heterogeneous subvolumes in cryo-electron tomography. , 2013, Journal of structural biology.

[23]  M. Barthelemy,et al.  From mobile phone data to the spatial structure of cities , 2014, Scientific Reports.

[24]  István Csabai,et al.  Race, religion and the city: twitter word frequency patterns reveal dominant demographic dimensions in the United States , 2015, Palgrave Communications.

[25]  Joseph Ferreira,et al.  Activity-Based Human Mobility Patterns Inferred from Mobile Phone Data: A Case Study of Singapore , 2017, IEEE Transactions on Big Data.

[26]  Steve Renals,et al.  Document space models using latent semantic analysis , 1997, EUROSPEECH.

[27]  Kyumin Lee,et al.  You are where you tweet: a content-based approach to geo-locating twitter users , 2010, CIKM.

[28]  Alessandro Vespignani,et al.  The Twitter of Babel: Mapping World Languages through Microblogging Platforms , 2012, PloS one.

[29]  Peter Druschel,et al.  Online social networks: measurement, analysis, and applications to distributed information systems , 2009 .

[30]  Tobias Preis,et al.  Quantifying the Impact of Scenic Environments on Health , 2015, Scientific Reports.

[31]  Huan Liu,et al.  Is the Sample Good Enough? Comparing Data from Twitter's Streaming API with Twitter's Firehose , 2013, ICWSM.

[32]  M. Williams,et al.  Who Tweets? Deriving the Demographic Characteristics of Age, Occupation and Social Class from Twitter User Meta-Data , 2015, PloS one.

[33]  Harry Eugene Stanley,et al.  Languages cool as they expand: Allometric scaling and the decreasing need for new words , 2012, Scientific Reports.

[34]  Dino Pedreschi,et al.  Returners and explorers dichotomy in human mobility , 2015, Nature Communications.

[35]  Alexander S. Szalay,et al.  Efficient classification of billions of points into complex geographic regions using hierarchical triangular mesh , 2014, SSDBM '14.

[36]  Lars Backstrom,et al.  Find me if you can: improving geographical prediction with social and spatial proximity , 2010, WWW '10.

[37]  Ladislav Kristoufek,et al.  Nowcasting Unemployment Rates with Google Searches: Evidence from the Visegrad Group Countries , 2014, PloS one.

[38]  Matthew Zook,et al.  The Technology of Religion: Mapping Religious Cyberscapes , 2012 .

[39]  H Eugene Stanley,et al.  Quantifying the semantics of search behavior before stock market moves , 2014, Proceedings of the National Academy of Sciences.

[40]  István Csabai,et al.  A multi-terabyte relational database for geo-tagged social network data , 2013, 2013 IEEE 4th International Conference on Cognitive Infocommunications (CogInfoCom).

[41]  Paul A. Longley,et al.  The Geotemporal Demographics of Twitter Usage , 2015 .

[42]  D. Helbing,et al.  Growth, innovation, scaling, and the pace of life in cities , 2007, Proceedings of the National Academy of Sciences.

[43]  Tony E. Smith,et al.  International Regional Science Review , 2014 .

[44]  Christian Schneider,et al.  Spatiotemporal Patterns of Urban Human Mobility , 2012, Journal of Statistical Physics.

[45]  Vanessa Frías-Martínez,et al.  Spectral clustering for sensing urban land use using Twitter activity , 2014, Engineering applications of artificial intelligence.

[46]  Zbigniew Smoreda,et al.  Delineating Geographical Regions with Networks of Human Interactions in an Extensive Set of Countries , 2013, PloS one.

[47]  Jure Leskovec,et al.  Friendship and mobility: user movement in location-based social networks , 2011, KDD.

[48]  Lincoln Quillian,et al.  Migration Patterns and the Growth of High‐Poverty Neighborhoods, 1970‐‐19901 , 1999, American Journal of Sociology.

[49]  Vincent D. Blondel,et al.  A survey of results on mobile phone datasets analysis , 2015, EPJ Data Science.

[50]  Matjaz Perc,et al.  Evolution of the most common English words and phrases over the centuries , 2012, Journal of The Royal Society Interface.

[51]  Eric P. Xing,et al.  Diffusion of Lexical Change in Social Media , 2012, PloS one.

[52]  Robert Tibshirani,et al.  Boolean implication networks derived from large scale, whole genome microarray datasets , 2008, Genome Biology.

[53]  Gregory J. Park,et al.  Psychological Language on Twitter Predicts County-Level Heart Disease Mortality , 2015, Psychological science.

[54]  Albert-László Barabási,et al.  Understanding individual human mobility patterns , 2008, Nature.

[55]  T. Geisel,et al.  The scaling laws of human travel , 2006, Nature.

[56]  Rima Wilkes,et al.  Does Socioeconomic Status Matter? Race, Class, and Residential Segregation , 2006 .

[57]  Carlo Ratti,et al.  Cellular Census: Explorations in Urban Data Collection , 2007, IEEE Pervasive Computing.

[58]  Marta C. González,et al.  A universal model for mobility and migration patterns , 2011, Nature.

[59]  Kyumin Lee,et al.  Exploring Millions of Footprints in Location Sharing Services , 2011, ICWSM.