On Utilizing Nonstandard Abbreviations and Lexicon to Infer Demographic Attributes of Twitter Users

Automatically determining demographic attributes of writers with high accuracy, based on their texts, can be useful for a range of application domains, including smart ad placement, security, the discovery of predator behaviors, enabling automatic enhancement of participants’ profiles for extended analysis, and various other applications. It is also of interest from the perspective to linguists who may wish to build on such inference for further sociolinguistic analysis. Previous work indicates that attributes such as author gender can be determined with some amount of success, using various methods, such as analysis of shallow linguistic patterns or topic, in authors’ written texts. Author age appears more difficult to determine, but previous research has been somewhat successful at classifying age as a binary (e.g. over or under 30), ternary, or even as a continuous variable using various techniques. In this work, we show that word and phrase abbreviation patterns can be used toward determining user age using novel binning, as well as toward determining binary user gender, and ternary user education level. Notable results include age classification accuracy of up to 83% (67% above relative majority class baseline) using a support vector machine classifier and PCA extracted features, including n-grams. User ages were classified into 10 equally sized age bins and achieved 51% accuracy (34% above baseline) when using only abbreviation features. Gender classification achieved 75% accuracy (13% above baseline) using only abbreviation features, PCA extracted, and education classification achieved 62% accuracy, 19% above baseline with PCA extracted abbreviation features. Also presented is an analysis of the evident change in author abbreviation use over time on Twitter.

[1]  Matthew J. Turk Analysis and Visualization of Multi-Scale Astrophysical Simulations Using Python and NumPy , 2008 .

[2]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[3]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[4]  Kalina Bontcheva,et al.  Microblog-genre noise and impact on semantic annotation accuracy , 2013, HT.

[5]  Suzanne Evans Wagner,et al.  Age Grading in Sociolinguistic Theory , 2012, Lang. Linguistics Compass.

[6]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[7]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[8]  Timothy Baldwin,et al.  Lexical Normalisation of Short Text Messages: Makn Sens a #twitter , 2011, ACL.

[9]  Yejin Choi,et al.  Gender Attribution: Tracing Stylometric Evidence Beyond Topic and Genre , 2011, CoNLL.

[10]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[11]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[12]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[13]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL 2006.

[14]  John D. Burger,et al.  Discriminating Gender on Twitter , 2011, EMNLP.

[15]  L. Venkata Subramaniam,et al.  Unsupervised cleansing of noisy text , 2010, COLING.

[16]  A. Brenner Twitter Use 2012 , 2012 .

[17]  Carolyn Penstein Rosé,et al.  Author Age Prediction from Text using Linear Regression , 2011, LaTeCH@ACL.

[18]  E. Hovy,et al.  Contextual Bearing on Linguistic Variation in Social Media , 2011 .

[19]  Brendan T. O'Connor,et al.  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[20]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[21]  Sara Rosenthal,et al.  Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations , 2011, ACL.

[22]  Cecilia Ovesdotter Alm,et al.  User-annotated microtext data for modeling and analyzing users' sociolinguistic characteristics and age grading , 2014, 2014 IEEE Eighth International Conference on Research Challenges in Information Science (RCIS).