TAT: An Author Profiling Tool with Application to Arabic Emails

This paper reports on the application of the Text Attribution Tool (TAT) to profiling the authors of Arabic emails. The TAT system has been developed for the purpose of language-independent author profiling and has now been trained on two email corpora, English and Arabic. We describe the overall TAT system and the Machine Learning experiments resulting in classifiers for the different author traits. Predictions for demographic and psychometric author traits show improvements over the baseline for some of the author traits with both the English and the Arabic data. Arabic presents particular challenges for NLP and this paper describes more specifically the text processing components developed to handle Arabic emails.

[1]  Anne N. De Roeck,et al.  A Morphologically Sensitive Clustering Algorithm for Identifying Arabic Roots , 2000, ACL.

[2]  W. Shakespeare,et al.  Shakespeare, Fletcher and "The Two Noble Kinsmen" , 1990 .

[3]  Christopher Alan Lewis,et al.  The Short-Form Revised Eysenck Personality Questionnaire (EPQR-S): A German edition , 2006 .

[4]  George M. Mohay,et al.  Mining e-mail content for author identification forensics , 2001, SGMD.

[5]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[6]  Hsinchun Chen,et al.  Applying authorship analysis to extremist-group Web forum messages , 2005, IEEE Intelligent Systems.

[7]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[8]  Hans J. Eysenck,et al.  Manual of the Eysenck personality questionnaire , 1975 .

[9]  Nizar Habash,et al.  On Arabic Transliteration , 2007 .

[10]  M. Jackson,et al.  Shakespeare, Fletcher, and The Two Noble Kinsmen. , 1990 .

[11]  Matthew Hurst,et al.  Deriving marketing intelligence from online discussion , 2005, KDD '05.

[12]  Kareem Darwish,et al.  Building a Shallow Arabic Morphological Analyser in One Day , 2002, SEMITIC@ACL.

[13]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[14]  Jon Oberlander,et al.  Whose Thumb Is It Anyway? Classifying Author Personality from Weblog Text , 2006, ACL.

[15]  Hsinchun Chen,et al.  A framework for authorship identification of online messages: Writing-style features and classification techniques , 2006 .

[16]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[17]  W. T. Norman,et al.  Toward an adequate taxonomy of personality attributes: replicated factors structure in peer nomination personality ratings. , 1963, Journal of abnormal and social psychology.

[18]  S. Sathiya Keerthi,et al.  Improvements to Platt's SMO Algorithm for SVM Classifier Design , 2001, Neural Computation.

[19]  Moshe Koppel,et al.  Automatically Determining an Anonymous Author's Native Language , 2005, ISI.

[20]  George M. Mohay,et al.  E-Mail Authorship Attribution for Computer Forensics , 2002, Applications of Data Mining in Computer Security.

[21]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[22]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[23]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[24]  S. Pham,et al.  Profiling for English Emails , 2007 .

[25]  Richard Elliott Friedman,et al.  Who Wrote the Bible , 1987 .

[26]  Rong Zheng,et al.  Authorship Analysis in Cybercrime Investigation , 2003, ISI.

[27]  A. Roeck,et al.  Assessment of a Significant Arabic Corpus , 2001 .

[28]  J. Pennebaker,et al.  LEXICAL PREDICTORS OFPERSONALITY TYPE , 2005 .

[29]  George M. Mohay,et al.  Gender-preferential text mining of e-mail discourse , 2002, 18th Annual Computer Security Applications Conference, 2002. Proceedings..

[30]  Tim Buckwalter Issues in Arabic Orthography and Morphology Analysis , 2004 .

[31]  Olivier de Vel,et al.  Mining E-mail Authorship , 2000 .

[32]  Dominique Estival,et al.  Author Profiling for English and Arabic Emails , 2008 .

[33]  Hsinchun Chen,et al.  Applying Authorship Analysis to Arabic Web Content , 2005, ISI.

[34]  John A. Johnson,et al.  Implementing a five-factor personality inventory for use on the internet , 2005 .