Author gender identification from Arabic text

Abstract The Gender Identification (GI) problem is concerned with determining the gender of a given text’s author. It has a wide range of academic/commercial applications in various fields including literature, security, forensics, electronic markets and trading, etc. To address this problem, researchers have proposed that the writing styles of authors of the same gender share certain aspects, which can be captured by certain stylometric features (SF). Another approach to address this problem focuses mainly on keywords occurrences in each document. This is known as the Bag-Of-Words (BOW) approach. In this work, we study and compare both approaches and focus on the Arabic language for which this problem is still largely understudied despite its importance. To the best of our knowledge, no previous work has considered these approaches for the GI problem of Arabic text. The comparison is carried out under different settings and the results show that the SF approach, which is much cheaper to train, can generate more accurate results under most settings. In fact, the best accuracy levels obtained by the SF and BOW approaches on our in-house dataset are 80.4% and 73.9%, respectively.

[1]  Halim Sayoud Automatic authorship classification of two ancient books: Quran and Hadith , 2014, 2014 IEEE/ACS 11th International Conference on Computer Systems and Applications (AICCSA).

[2]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[3]  Nizar Habash,et al.  Introduction to Arabic Natural Language Processing , 2010, Introduction to Arabic Natural Language Processing.

[4]  Rajarathnam Chandramouli,et al.  Author gender identification from text , 2011, Digit. Investig..

[5]  Halim Sayoud,et al.  Effect of the Text Size on Stylometry - Application on Arabic Religious Texts , 2016, ICCSAMA.

[6]  David Corne,et al.  Authorship Attribution in Arabic using a hybrid of evolutionary search and linear discriminant analysis , 2010, 2010 UK Workshop on Computational Intelligence (UKCI).

[7]  Efstathios Stamatatos,et al.  Author identification: Using text sampling to handle the class imbalance problem , 2008, Inf. Process. Manag..

[8]  Mohamed El Bachir Menai,et al.  Naïve Bayes classifiers for authorship attribution of Arabic texts , 2014, J. King Saud Univ. Comput. Inf. Sci..

[9]  Yunming Ye,et al.  An Improved Random Forest Classifier for Text Categorization , 2012, J. Comput..

[10]  Matthias Hagen,et al.  Who Wrote the Web? Revisiting Influential Author Identification Research Applicable to Information Retrieval , 2016, ECIR.

[11]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[12]  Hsinchun Chen,et al.  Applying authorship analysis to extremist-group Web forum messages , 2005, IEEE Intelligent Systems.

[13]  Yejin Choi,et al.  Gender Attribution: Tracing Stylometric Evidence Beyond Topic and Genre , 2011, CoNLL.

[14]  Mahmoud Al-Ayyoub,et al.  Using Big Data Analytics for Authorship Authentication of Arabic Tweets , 2015, 2015 IEEE/ACM 8th International Conference on Utility and Cloud Computing (UCC).

[15]  Nemanja Djuric,et al.  Leveraging Blogging Activity on Tumblr to Infer Demographics and Interests of Users for Advertising Purposes , 2016, #Microposts.

[16]  Joel R. Tetreault,et al.  A Report on the First Native Language Identification Shared Task , 2013, BEA@NAACL-HLT.

[17]  Mohammad S. Khorsheed,et al.  Comparative evaluation of text classification techniques using a large diverse Arabic dataset , 2013, Language Resources and Evaluation.

[18]  Halim Sayoud A Visual Analytics based Investigation on the Authorship of the Holy Quran , 2015, IVAPP.

[19]  Efstathios Stamatatos A survey of modern authorship attribution methods , 2009 .

[20]  Ron Kohavi,et al.  The Power of Decision Tables , 1995, ECML.

[21]  Solee Kim,et al.  An on-device gender prediction method for mobile users using representative wordsets , 2016, Expert Syst. Appl..

[22]  Benno Stein,et al.  Overview of the Author Profiling Task at PAN 2013 , 2013, CLEF.

[23]  Jo Bryce,et al.  Young people, disclosure of personal information and online privacy: Control, choice and consequences , 2009, Inf. Secur. Tech. Rep..

[24]  Ning Wu,et al.  On Compression-Based Text Classification , 2005, ECIR.

[25]  Marko Robnik-Sikonja,et al.  Overcoming the Myopia of Inductive Learning Algorithms with RELIEFF , 2004, Applied Intelligence.

[26]  Mona T. Diab,et al.  Addressing Annotation Complexity: The Case of Annotating Ideological Perspective in Egyptian Social Media , 2016, LAW@ACL.

[27]  Edward D Rothman,et al.  Statistics, methods and applications , 1987 .

[28]  Khaled Shaalan,et al.  Arabic Natural Language Processing: Challenges and Solutions , 2009, TALIP.

[29]  Ian H. Witten,et al.  Generating Accurate Rule Sets Without Global Optimization , 1998, ICML.

[30]  Rebhi S. Baraka,et al.  Arabic text author identification using support vector machines , 2014 .

[31]  Halim Sayoud,et al.  Towards an authorship analysis of two religious documents , 2016, 2016 8th International Conference on Modelling, Identification and Control (ICMIC).

[32]  Mahmoud Al-Ayyoub,et al.  Automatic categorization of Arabic articles based on their political orientation , 2018, Digit. Investig..

[33]  H. Sayoud,et al.  Authorship attribution of ancient texts written by ten arabic travelers using a SMO-SVM classifier , 2012, 2012 International Conference on Communications and Information Technology (ICCIT).

[34]  Mahmoud Al-Ayyoub,et al.  An extensive study of authorship authentication of Arabic articles , 2017, Int. J. Web Inf. Syst..

[35]  Mahmoud Al-Ayyoub,et al.  An extensive study of the Bag-of-Words approach for gender identification of Arabic articles , 2014, 2014 IEEE/ACS 11th International Conference on Computer Systems and Applications (AICCSA).

[36]  Benno Stein,et al.  Overview of the 4th Author Profiling Task at PAN 2016: Cross-Genre Evaluations , 2016, CLEF.

[37]  Olivier de Vel,et al.  Mining E-mail Authorship , 2000 .

[38]  Mahmoud Al-Ayyoub,et al.  Authorship attribution of Arabic tweets , 2016, 2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA).

[39]  Mahmoud Al-Ayyoub,et al.  Feature extraction and selection for Arabic tweets authorship authentication , 2017, J. Ambient Intell. Humaniz. Comput..

[40]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[41]  Efstathios Stamatatos,et al.  Authorship Attribution for Social Media Forensics , 2017, IEEE Transactions on Information Forensics and Security.

[42]  Patrick Juola,et al.  Large-Scale Experiments in Authorship Attribution , 2012 .

[43]  S. Bourib,et al.  Author Identification Using Different Sizes of Documents: A Summary. , 2015 .

[44]  Ronen Feldman,et al.  Book Reviews: The Text Mining Handbook: Advanced Approaches to Analyzing Unstructured Data by Ronen Feldman and James Sanger , 2008, CL.

[45]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[46]  Georgios Kambourakis,et al.  Anonymity and closely related terms in the cyberspace: An analysis by example , 2014, J. Inf. Secur. Appl..

[47]  Hsinchun Chen,et al.  Applying Authorship Analysis to Arabic Web Content , 2005, ISI.

[48]  Halim Sayoud Segmental Analysis-Based Authorship Discrimination between the Holy Quran and Prophet’s Statements , 2015 .

[49]  S. Fienberg,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[50]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[51]  Nayer M. Wanas,et al.  A Study of Text Preprocessing Tools for Arabic Text Categorization , 2009 .

[52]  Moshe Koppel,et al.  Automatically Classifying Documents by Ideological and Organizational Affiliation , 2009, 2009 IEEE International Conference on Intelligence and Security Informatics.

[53]  Colin Martindale,et al.  On the utility of content analysis in author attribution:The Federalist , 1995, Comput. Humanit..

[54]  David W. Corne,et al.  Investigating hybrids of evolutionary search and linear discriminant analysis for authorship attribution , 2007, 2007 IEEE Congress on Evolutionary Computation.

[55]  Halim Sayoud,et al.  Author discrimination between the Holy Quran and Prophet's statements , 2012, Lit. Linguistic Comput..

[56]  Mahmoud Al-Ayyoub,et al.  Emotion analysis of Arabic articles and its impact on identifying the author's gender , 2015, 2015 IEEE/ACS 12th International Conference of Computer Systems and Applications (AICCSA).

[57]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[58]  Mahmoud Al-Ayyoub,et al.  On authorship authentication of Arabic articles , 2014, 2014 5th International Conference on Information and Communication Systems (ICICS).

[59]  Halim Sayoud,et al.  Authorship Attribution of Short Historical Arabic Texts Based on Lexical Features , 2013, 2013 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery.

[60]  Eibe Frank,et al.  Speeding Up Logistic Model Tree Induction , 2005, PKDD.

[61]  Zachary Miller,et al.  Author Gender Prediction in an Email Stream Using Neural Networks , 2012 .

[62]  Saif Mohammad,et al.  CROWDSOURCING A WORD–EMOTION ASSOCIATION LEXICON , 2013, Comput. Intell..

[63]  Dominique Estival,et al.  TAT: An Author Profiling Tool with Application to Arabic Emails , 2007, ALTA.

[64]  Jon Oberlander,et al.  The Identity of Bloggers: Openness and Gender in Personal Weblogs , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[65]  Shervin Malmasi,et al.  Arabic Native Language Identification , 2014, ANLP@EMNLP.