Knowing the Tweeters: Deriving Sociologically Relevant Demographics from Twitter

A perennial criticism regarding the use of social media in social science research is the lack of demographic information associated with naturally occurring mediated data such as that produced by Twitter. However the fact that demographics information is not explicit does not mean that it is not implicitly present. Utilising the Cardiff Online Social Media ObServatory (COSMOS) this paper suggests various techniques for establishing or estimating demographic data from a sample of more than 113 million Twitter users collected during July 2012. We discuss in detail the methods that can be used for identifying gender and language and illustrate that the proportion of males and females using Twitter in the UK reflects the gender balance observed in the 2011 Census. We also expand on the three types of geographical information that can be derived from Tweets either directly or by proxy and how spatial information can be used to link social media with official curated data. Whilst we make no grand claims about the representative nature of Twitter users in relation to the wider UK population, the derivation of demographic data demonstrates the potential of new social media (NSM) for the social sciences. We consider this paper a clarion call and hope that other researchers test the methods we suggest and develop them further.

[1]  Adam Michael Edwards,et al.  Digital social research, social media and the sociological imagination: surrogacy, augmentation and re-orientation , 2013 .

[2]  Peter Burnap,et al.  Making sense of self-reported socially significant data using computational methods , 2013 .

[3]  P. West,et al.  Zones of Practice: Embodiment and Creative Arts Research , 2012 .

[4]  Haining Wang,et al.  Detecting Social Spam Campaigns on Twitter , 2012, ACNS.

[5]  Axel Bruns,et al.  Mapping the Australian Networked Public Sphere , 2011 .

[6]  Rajarathnam Chandramouli,et al.  Author gender identification from text , 2011, Digit. Investig..

[7]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[8]  Bernardo A. Huberman,et al.  Predicting the Future with Social Media , 2010, 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[9]  Bernhard Debatin,et al.  Facebook and Online Privacy: Attitudes, Behaviors, and Unintended Consequences , 2009, J. Comput. Mediat. Commun..

[10]  Roger Burrows,et al.  Some Further Reflections on the Coming Crisis of Empirical Sociology , 2009 .

[11]  Vitaly Shmatikov,et al.  2009 30th IEEE Symposium on Security and Privacy De-anonymizingSocialNetworks , 2022 .

[12]  Jeremy Ginsberg,et al.  Detecting influenza epidemics using search engine query data , 2009, Nature.

[13]  Tristan Henderson,et al.  Virtual Walls: Protecting Digital Privacy in Pervasive Environments , 2007, Pervasive.

[14]  Roger Burrows,et al.  The Coming Crisis of Empirical Sociology , 2007, Sociology.

[15]  David A. Huffaker,et al.  Gender, Identity, and Language Use in Teenage Blogs , 2006, J. Comput. Mediat. Commun..

[16]  R. Thomson,et al.  Predicting gender from electronic discourse. , 2001, The British journal of social psychology.

[17]  Li-Ning Huang,et al.  Gender Identification, Interdependence, and Pseudonyms in CMC: Language Patterns in an Electronic Conference , 1999, Inf. Soc..