Using Content-Based Features for Author Profiling of Vietnamese Forum Posts

This paper reports the results of author profiling task for Vietnamese forum posts to identify the personal traits, such as gender, age, occupation, and location of the author using content-based features. Experiments were conducted on the different types of features, including stylometric features (such as lexical, syntactic, structural features) as well as content-based features (the most important words) to compare the performance and on the data sets we collected from the various forums in Vietnamese. Three learning methods, consisting of Decision Tree, Bayes Network, Support Vector Machine (SVM), were tested and the SVM achieved the best results. The results show that these kinds of features work well on such a kind of short and free style messages as forum posts, in which, content-based features yielded much better results than stylometric features.

[1]  Carolyn Penstein Rosé,et al.  Proceedings of the 5th ACL Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, LaTeCH@ACL 2011, 24 June, 2011, Portland, Oregon, USA , 2011 .

[2]  George M. Mohay,et al.  Mining e-mail content for author identification forensics , 2001, SGMD.

[3]  Son Bao Pham,et al.  Author Profiling for Vietnamese Blogs , 2009, 2009 International Conference on Asian Language Processing.

[4]  Dong Nguyen,et al.  "How Old Do You Think I Am?" A Study of Language and Age in Twitter , 2013, ICWSM.

[5]  Shlomo Argamon,et al.  Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[6]  Özgür Ulusoy,et al.  Static index pruning in web search engines: Combining term and document popularities with query views , 2012, TOIS.

[7]  Jacques Savoy,et al.  Authorship Attribution Based on Specific Vocabulary , 2012, TOIS.

[8]  Prabaharan Poornachandran,et al.  Ensemble Learning Approach for Author Profiling Notebook for PAN at CLEF 2014 , 2014 .

[9]  Hsinchun Chen,et al.  Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace , 2008, TOIS.

[10]  Rong Zheng,et al.  Authorship Analysis in Cybercrime Investigation , 2003, ISI.

[11]  Anat Rachel Shimoni,et al.  Gender, genre, and writing style in formal written texts , 2003 .

[12]  Walter Daelemans,et al.  Predicting age and gender in online social networks , 2011, SMUC '11.

[13]  Hsinchun Chen,et al.  Applying authorship analysis to extremist-group Web forum messages , 2005, IEEE Intelligent Systems.

[14]  Ophir Frieder,et al.  Repeatable evaluation of search services in dynamic environments , 2007, TOIS.

[15]  Shlomo Argamon,et al.  Automatically profiling the author of an anonymous text , 2009, CACM.

[16]  Sudeshna Sarkar,et al.  Stylometric Analysis of Bloggers' Age and Gender , 2009, ICWSM.

[17]  George M. Mohay,et al.  Gender-preferential text mining of e-mail discourse , 2002, 18th Annual Computer Security Applications Conference, 2002. Proceedings..

[18]  Mathias Rossignol,et al.  An empirical study of maximum entropy approach for part-of-speech tagging of Vietnamese texts , 2010, JEPTALNRECITAL.

[19]  Shlomo Argamon,et al.  Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[20]  Hsinchun Chen,et al.  A framework for authorship identification of online messages: Writing-style features and classification techniques , 2006 .

[21]  Efstathios Stamatatos,et al.  Automatic Text Categorization In Terms Of Genre and Author , 2000, CL.