Toward inferring the age of Twitter users with their use of nonstandard abbreviations and lexicon

Automatically determining demographic profile attributes of writers with high accuracy, based on their texts, can be useful for a range of application domains, including smart ad placement, security, the discovery of predator behaviors, enabling automatic enhancement of participants profiles for extended analysis, and various other applications. Attributes such as author gender can be determined with some amount of success from many sources, using various methods, such as analysis of shallow linguistic patterns or topic. Author age is more difficult to determine, but previous research has been somewhat successful at classifying age as a binary (e.g. over or under 30), ternary, or even as a continuous variable using various techniques. In this work, we show that word and phrase abbreviation patterns can be used toward determining user age using novel binning. Notable results include classification accuracy of up to 82.8%, which was 67.0% above relative majority class baseline when classifying user ages into 10 equally sized age bins using a support vector machine classifier and PCA extracted features (including n-grams) and 50.8% (33.7% above baseline) when using only abbreviation features. Also presented is an analysis of the evident change in abbreviation use over time on Twitter.

[1]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[2]  John D. Burger,et al.  Discriminating Gender on Twitter , 2011, EMNLP.

[3]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[4]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[5]  Matthew J. Turk Analysis and Visualization of Multi-Scale Astrophysical Simulations Using Python and NumPy , 2008 .

[6]  Sara Rosenthal,et al.  Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations , 2011, ACL.

[7]  L. Venkata Subramaniam,et al.  Unsupervised cleansing of noisy text , 2010, COLING.

[8]  Carolyn Penstein Rosé,et al.  Author Age Prediction from Text using Linear Regression , 2011, LaTeCH@ACL.

[9]  E. Hovy,et al.  Contextual Bearing on Linguistic Variation in Social Media , 2011 .

[10]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[11]  A. Brenner Twitter Use 2012 , 2012 .

[12]  Brendan T. O'Connor,et al.  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[13]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[14]  Max Kaufmann Syntactic Normalization of Twitter Messages , 2010 .

[15]  Yejin Choi,et al.  Gender Attribution: Tracing Stylometric Evidence Beyond Topic and Genre , 2011, CoNLL.

[16]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[17]  Suzanne Evans Wagner,et al.  Age Grading in Sociolinguistic Theory , 2012, Lang. Linguistics Compass.

[18]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[19]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[20]  Timothy Baldwin,et al.  Lexical Normalisation of Short Text Messages: Makn Sens a #twitter , 2011, ACL.

[21]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL 2006.