What Your Username Says About You

Usernames are ubiquitous on the Internet, and they are often suggestive of user demographics. This work looks at the degree to which gender and language can be inferred from a username alone by making use of unsupervised morphology induction to decompose usernames into sub-units. Experimental results on the two tasks demonstrate the effectiveness of the proposed morphological features compared to a character n-gram baseline.

[1]  Theresa Wilson,et al.  Language Identification for Creating Language-Specific Twitter Collections , 2012 .

[2]  Laura Pelletier You've Got Mail: Identity Perceptions based on Email Usernames , 2009 .

[3]  Ana-Maria Popescu,et al.  A Machine Learning Approach to Twitter User Classification , 2011, ICWSM.

[4]  ThrunSebastian,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000 .

[5]  Samira Hassa,et al.  Projecting, Exposing, Revealing Self in the Digital World: Usernames as a Social Practice in a Moroccan Chatroom , 2012 .

[6]  Scott L. Crabill Comparative content analysis of social identity cues within a white supremacist discussion board and a social activist discussion board , 2007 .

[7]  D. Ruths,et al.  What's in a Name? Using First Names as Features for Gender Inference in Twitter , 2013, AAAI Spring Symposium: Analyzing Microtext.

[8]  Timothy Baldwin,et al.  Accurate Language Identification of Twitter Messages , 2014 .

[9]  Geert Booij,et al.  The grammar of words : an introduction to linguistic morphology , 2005 .

[10]  Arkaitz Zubiaga,et al.  Overview of TweetLID: Tweet Language Identification at SEPLN 2014 , 2014, TweetLID@SEPLN.

[11]  John D. Burger,et al.  Discriminating Gender on Twitter , 2011, EMNLP.

[12]  Eibe Frank,et al.  Naive Bayes for Text Classification with Unbalanced Classes , 2006, PKDD.

[13]  Mathias Creutz,et al.  Morfessor in the Morpho Challenge , 2006 .

[14]  Mikko Kurimo,et al.  Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline , 2013 .

[15]  Timothy Baldwin,et al.  langid.py: An Off-the-shelf Language Identification Tool , 2012, ACL.

[16]  Kristine L. Nowak,et al.  Utilizing Usernames for Sex Categorization in Computer-Mediated Communication: Examining Perceptions and Accuracy , 2006, Cyberpsychology Behav. Soc. Netw..

[17]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.