Author Profiling for English and Arabic Emails

This paper reports on some aspects of a research project aimed at automating the analysis of texts for the purpose of author profiling and identification. The Text Attribution Tool (TAT) was developed for the purpose of language-independent author profiling and has now been trained on two email corpora, English and Arabic. The complete analysis provides probabilities for the author’s basic demographic traits (gender, age, geographic origin, level of education and native language) as well as for five psychometric traits. The prototype system also provides a probability of a match with other texts, whether from known or unknown authors. A very important part of the project was the data collection and we give an overview of the collection process as well as a detailed description of the corpus of email data which was collected. We describe the overall TAT system and its components before outlining the ways in which the email data is processed and analysed. Because Arabic presents particular challenges for NLP, this paper also describes more specifically the text processing components developed to handle Arabic emails. Finally, we describe the Machine Learning setup used to produce classifiers for the different author traits and we present the experimental results, which are promising for most traits examined.

[1]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[2]  Jon Oberlander,et al.  Whose Thumb Is It Anyway? Classifying Author Personality from Weblog Text , 2006, ACL.

[3]  Rong Zheng,et al.  Authorship Analysis in Cybercrime Investigation , 2003, ISI.

[4]  John A. Johnson,et al.  Implementing a five-factor personality inventory for use on the internet , 2005 .

[5]  S. Sathiya Keerthi,et al.  Improvements to Platt's SMO Algorithm for SVM Classifier Design , 2001, Neural Computation.

[6]  Dominique Estival,et al.  TAT: An Author Profiling Tool with Application to Arabic Emails , 2007, ALTA.

[7]  William W. Cohen,et al.  Learning to Extract Signature and Reply Lines from Email , 2004, CEAS.

[8]  Richard Elliott Friedman,et al.  Who Wrote the Bible , 1987 .

[9]  I. Deary,et al.  Personality Traits: Preface to the second edition , 2009 .

[10]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[11]  Anne N. De Roeck,et al.  A Morphologically Sensitive Clustering Algorithm for Identifying Arabic Roots , 2000, ACL.

[12]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[13]  A. Roeck,et al.  Assessment of a Significant Arabic Corpus , 2001 .

[14]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[15]  M. Jackson,et al.  Shakespeare, Fletcher, and The Two Noble Kinsmen. , 1990 .

[16]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[17]  George M. Mohay,et al.  Mining e-mail content for author identification forensics , 2001, SGMD.

[18]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[19]  George M. Mohay,et al.  E-Mail Authorship Attribution for Computer Forensics , 2002, Applications of Data Mining in Computer Security.

[20]  Tim Buckwalter Issues in Arabic Orthography and Morphology Analysis , 2004 .

[21]  Hsinchun Chen,et al.  Applying authorship analysis to extremist-group Web forum messages , 2005, IEEE Intelligent Systems.

[22]  Christopher Alan Lewis,et al.  The Short-Form Revised Eysenck Personality Questionnaire (EPQR-S): A German edition , 2006 .

[23]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[24]  I. Deary,et al.  Personality Traits: Stress , 2003 .

[25]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[26]  Matthew Hurst,et al.  Deriving marketing intelligence from online discussion , 2005, KDD '05.

[27]  Hsinchun Chen,et al.  Applying Authorship Analysis to Arabic Web Content , 2005, ISI.

[28]  George M. Mohay,et al.  Gender-preferential text mining of e-mail discourse , 2002, 18th Annual Computer Security Applications Conference, 2002. Proceedings..

[29]  Max Coltheart,et al.  The MRC Psycholinguistic Database , 1981 .

[30]  Kareem Darwish,et al.  Building a Shallow Arabic Morphological Analyser in One Day , 2002, SEMITIC@ACL.

[31]  M. Walker,et al.  Words Mark the Nerds: Computational Models of Personality Recognition through Language , 2006 .

[32]  Hans J. Eysenck,et al.  Manual of the Eysenck personality questionnaire , 1975 .

[33]  Nizar Habash,et al.  On Arabic Transliteration , 2007 .

[34]  J. Pennebaker,et al.  LEXICAL PREDICTORS OFPERSONALITY TYPE , 2005 .

[35]  W. T. Norman,et al.  Toward an adequate taxonomy of personality attributes: replicated factors structure in peer nomination personality ratings. , 1963, Journal of abnormal and social psychology.

[36]  Shlomo Argamon,et al.  Authorship attribution with thousands of candidate authors , 2006, SIGIR.

[37]  S. Pham,et al.  Profiling for English Emails , 2007 .