Learning Age and Gender of Blogger from Stylistic Variation

We report results of stylistic differences in blogging for gender and age group variation. The results are based on two mutually independent features. The first feature is the use of slang words which is a new concept proposed by us for Stylistic study of bloggers. For the second feature, we have analyzed the variation in average length of sentences across various age groups and gender. These features are augmented with previous study results reported in literature for stylistic analysis. The combined feature list enhances the accuracy by a remarkable extent in predicting age and gender. These machine learning experiments were done on two separate demographically tagged blog corpus. Gender determination is more accurate than age group detection over the data spread across all ages but the accuracy of age prediction increases if we sample data with remarkable age difference.

[1]  Minna Palander-Collin Male and female styles in 17th century correspondence: I THINK , 1999, Language Variation and Change.

[2]  Shlomo Argamon,et al.  Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[3]  Ian Witten,et al.  Data Mining , 2000 .

[4]  S. Herring Computer-mediated communication : linguistic, social and cross-cultural perspectives , 1996 .

[5]  Gerald Mcmenamin Forensic Linguistics: Advances in Forensic Stylistics , 2002 .

[6]  John D. Burger,et al.  An Exploration of Observable Features Related to Blogger Age , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[7]  J. Simkins-Bullock,et al.  An investigation into the relationships between gender and language , 1991 .

[8]  Dominique Estival,et al.  TAT: An Author Profiling Tool with Application to Arabic Emails , 2007, ALTA.

[9]  J. Pennebaker,et al.  PERSONALITY PROCESSES AND INDIVIDUAL DIFFERENCES Words of Wisdom: Language Use Over the Life Span , 2003 .

[10]  Sudeshna Sarkar,et al.  A comparative study of statistical features of language in blogs-vs-splogs , 2008, AND '08.

[11]  Shlomo Argamon,et al.  Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[12]  George M. Mohay,et al.  Gender-preferential text mining of e-mail discourse , 2002, 18th Annual Computer Security Applications Conference, 2002. Proceedings..

[13]  Sudeshna Sarkar,et al.  Stylometric Analysis of Bloggers' Age and Gender , 2009, ICWSM.

[14]  S. Herring Two variants of an electronic message schema , 1996 .

[15]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[16]  Marko Grobelnik,et al.  Feature Selection Using Support Vector Machines , 2002 .