An extensive study of the Bag-of-Words approach for gender identification of Arabic articles

The prevalent use of Online Social Networks (OSN) and the anonymity and lack of accountability they inherent from being online give rise to many problems related to finding the connection between the massive amount of text data on OSN and the people who actually wrote them. Analyzing text data for such purposes is called authorship analysis. This work is focused on one specific type of authorship analysis, which is identifying the author's gender. Gender identification has various applications from marketing to security. The focus of this work is on Arabic articles. The problem is basically a classification problem and the current approaches differ in the way they compute the features of each document. However, they all agree on following some “stylometric features” approach. Unlike these works, ours treat this problem as a variation of the Text Classification (TC) problem and follow the Bag-Of-Words (BOW) approach for feature selection. We perform an extensive set of experiments on the feature selection and classification phase and the results show that such an approach yield surprisingly high results.

[1]  Mahmoud Al-Ayyoub,et al.  An analytical study of Arabic sentiments: Maktoob case study , 2013, 8th International Conference for Internet Technology and Secured Transactions (ICITST-2013).

[2]  Charu C. Aggarwal,et al.  A Survey of Text Classification Algorithms , 2012, Mining Text Data.

[3]  Claudia Leacock,et al.  Proceedings of the Fourth Workshop on Innovative Use of NLP for Building Educational Applications , 2008 .

[4]  Patrick Juola,et al.  Large-Scale Experiments in Authorship Attribution , 2012 .

[5]  Yunming Ye,et al.  An Improved Random Forest Classifier for Text Categorization , 2012, J. Comput..

[6]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[7]  Efstathios Stamatatos,et al.  Author identification: Using text sampling to handle the class imbalance problem , 2008, Inf. Process. Manag..

[8]  Yong Wang,et al.  Using Model Trees for Classification , 1998, Machine Learning.

[9]  Hsinchun Chen,et al.  Applying authorship analysis to extremist-group Web forum messages , 2005, IEEE Intelligent Systems.

[10]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[11]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[12]  Chris Callison-Burch,et al.  Arabic Dialect Identification , 2014, CL.

[13]  Ning Wu,et al.  On Compression-Based Text Classification , 2005, ECIR.

[14]  Mahmoud Al-Ayyoub,et al.  On authorship authentication of Arabic articles , 2014, 2014 5th International Conference on Information and Communication Systems (ICICS).

[15]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[16]  Mahmoud Al-Ayyoub,et al.  Automatic Lexicon Construction for Arabic Sentiment Analysis , 2014, 2014 International Conference on Future Internet of Things and Cloud.

[17]  Hsinchun Chen,et al.  Sentiment analysis in multiple languages: Feature selection for opinion classification in Web forums , 2008, TOIS.

[18]  Mohammad S. Khorsheed,et al.  Comparative evaluation of text classification techniques using a large diverse Arabic dataset , 2013, Language Resources and Evaluation.

[19]  Dominique Estival,et al.  TAT: An Author Profiling Tool with Application to Arabic Emails , 2007, ALTA.

[20]  Marko Robnik-Sikonja,et al.  Overcoming the Myopia of Inductive Learning Algorithms with RELIEFF , 2004, Applied Intelligence.

[21]  Mahmoud Al-Ayyoub,et al.  Compression-based arabic text classification , 2014, 2014 IEEE/ACS 11th International Conference on Computer Systems and Applications (AICCSA).

[22]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[23]  Ron Kohavi,et al.  The Power of Decision Tables , 1995, ECML.

[24]  Mahmoud Al-Ayyoub,et al.  Arabic sentiment analysis: Lexicon-based and corpus-based , 2013, 2013 IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT).

[25]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[26]  Mahmoud Al-Ayyoub,et al.  An extended analytical study of Arabic sentiments , 2014, Int. J. Big Data Intell..

[27]  Ian H. Witten,et al.  Generating Accurate Rule Sets Without Global Optimization , 1998, ICML.

[28]  Mahmoud Al-Ayyoub,et al.  Cross-Lingual Short-Text Document Classification for Facebook Comments , 2014, 2014 International Conference on Future Internet of Things and Cloud.

[29]  Motaz Saad,et al.  The Impact of Text Preprocessing and Term Weighting on Arabic Text Classification , 2010 .

[30]  Rajarathnam Chandramouli,et al.  Author gender identification from text , 2011, Digit. Investig..

[31]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[32]  José María Gómez Hidalgo,et al.  Content based SMS spam filtering , 2006, DocEng '06.

[33]  David Corne,et al.  Authorship Attribution in Arabic using a hybrid of evolutionary search and linear discriminant analysis , 2010, 2010 UK Workshop on Computational Intelligence (UKCI).

[34]  Patrick Juola,et al.  Authorship Attribution , 2008, Found. Trends Inf. Retr..

[35]  Halim Sayoud,et al.  Authorship Attribution of Short Historical Arabic Texts Based on Lexical Features , 2013, 2013 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery.

[36]  Mahmoud Al-Ayyoub,et al.  Lexicon-based sentiment analysis of Arabic tweets , 2015, Int. J. Soc. Netw. Min..

[37]  Eibe Frank,et al.  Speeding Up Logistic Model Tree Induction , 2005, PKDD.

[38]  Nayer M. Wanas,et al.  A Study of Text Preprocessing Tools for Arabic Text Categorization , 2009 .

[39]  Moshe Koppel,et al.  Automatically Classifying Documents by Ideological and Organizational Affiliation , 2009, 2009 IEEE International Conference on Intelligence and Security Informatics.