Authorship Attribution and Gender Identification in Greek Blogs

The aim of this study is to obtain authorship attribution and author’s gender identification in a corpus of blogs written in Modern Greek language. More specifically, the corpus used contains 20 bloggers equally divided by gender (10 males & 10 females) with 50 blog posts from each author (1,000 posts in total and an overall size of 406,460 words). From this corpus we calculated a number of standard stylometric variables (e.g. word length statistics and various vocabulary “richness” indices) and 300 most frequent word and character n-grams (character and word unigrams, bigrams, trigrams). Support Vector Machines (SVM) were trained on this data, and the author’s gender prediction accuracy in 10-fold cross-validation experiment reached 82.6% accuracy, a result that is comparable to current state-of-the-art author profiling systems. Authorship attribution accuracy reached 85.4%, an equally satisfying result given the large number of candidate authors (n=20).

[1]  Amr Ahmed,et al.  The Affects of Demographics Differentiations on Authorship Identification , 2010 .

[2]  Antonio Miranda García,et al.  Function Words in Authorship Attribution Studies , 2007, Lit. Linguistic Comput..

[3]  Patrick Juola,et al.  Authorship Attribution , 2008, Found. Trends Inf. Retr..

[4]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[5]  James W. Pennebaker,et al.  Linguistic Inquiry and Word Count (LIWC2007) , 2007 .

[6]  Shlomo Argamon,et al.  Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[7]  Sarah Steiner Gender, Genre, and Writing Style in Formal Written Texts , 2014 .

[8]  Carole E. Chaski,et al.  Who's At The Keyboard? Authorship Attribution in Digital Evidence Investigations , 2005, Int. J. Digit. EVid..

[9]  Arjun Mukherjee,et al.  Improving Gender Classification of Blog Authors , 2010, EMNLP.

[10]  J. Pennebaker,et al.  LEXICAL PREDICTORS OFPERSONALITY TYPE , 2005 .

[11]  Amr Ahmed,et al.  More Blogging Features for Author Identification , 2009 .

[12]  Son Bao Pham,et al.  Author Profiling for Vietnamese Blogs , 2009, 2009 International Conference on Asian Language Processing.

[13]  Benjamin C. M. Fung,et al.  Mining writeprints from anonymous e-mails for forensic investigation , 2010, Digit. Investig..

[14]  W. Chafe,et al.  Properties of spoken and written language. , 1987 .

[15]  Jörg Kindermann,et al.  Authorship Attribution with Support Vector Machines , 2003, Applied Intelligence.

[16]  George K. Mikros,et al.  Investigating Topic Influence in Authorship Attribution , 2007, PAN.

[17]  Moshe Koppel,et al.  Authorship verification as a one-class classification problem , 2004, ICML.

[18]  Yejin Choi,et al.  Gender Attribution: Tracing Stylometric Evidence Beyond Topic and Genre , 2011, CoNLL.

[19]  Amr Ahmed,et al.  Under Consideration for Publication in Knowledge and Information Systems Two-layered Blogger Identification Model Integrating Profile and Instance-based Methods , 2022 .

[20]  Shlomo Argamon,et al.  Mining the Blogosphere: Age, gender and the varieties of self-expression , 2007, First Monday.

[21]  George M. Mohay,et al.  Multi-Topic E-mail Authorship Attribution Forensics , 2001 .

[22]  Walter Daelemans,et al.  Using syntactic features to predict author personality from text , 2008 .

[23]  Gilad Mishne,et al.  Applied text analytics for blogs , 2007 .

[24]  Michael Oakes,et al.  Statistics for Corpus Linguistics , 1998 .

[25]  Rong Zheng,et al.  From fingerprint to writeprint , 2006, Commun. ACM.

[26]  R. Harald Baayen,et al.  How Variable May a Constant be? Measures of Lexical Richness in Perspective , 1998, Comput. Humanit..

[27]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[28]  Hugo Jair Escalante,et al.  Local Histograms of Character N-grams for Authorship Attribution , 2011, ACL.

[29]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[30]  David L. Hoover,et al.  Another Perspective on Vocabulary Richness , 2003, Comput. Humanit..

[31]  Hans Van Halteren,et al.  Author verification by linguistic profiling: An exploration of the parameter space , 2007, TSLP.

[32]  Shlomo Argamon,et al.  Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[33]  Hsinchun Chen,et al.  A framework for authorship identification of online messages: Writing-style features and classification techniques , 2006 .

[34]  Efstathios Stamatatos,et al.  N-Gram Feature Selection for Authorship Identification , 2006, AIMSA.

[35]  Frederick Mosteller,et al.  Applied Bayesian And Classical Inference , 1984 .

[36]  Xiang Yan,et al.  Gender Classification of Weblog Authors , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[37]  Walter Daelemans,et al.  Personae: a Corpus for Author and Personality Prediction from Text , 2008, LREC.

[38]  Benjamin C. M. Fung,et al.  e-mail authorship verification for forensic investigation , 2010, SAC '10.