A Comparative Survey of Authorship Attribution on Short Arabic Texts

In this paper, we deal with the problem of authorship attribution (AA) on short Arabic texts. So, we make a survey on a set of several features and classifiers that are employed for the task of AA. This investigation uses characters, character bigrams, character trigrams, character tetragrams, words, word bigrams and rare words. The AA is ensured by 4 different measures, 3 classifiers (Multi-Layer Perceptron (MLP), Support Vector Machines (SVM) and Linear Regression (LR)) and a new proposed fusion called VBF (i.e. Vote Based Fusion). The evaluation is done on short Arabic texts extracted from the AAAT dataset (AA of Ancient Arabic Texts). Although the task of AA is known to be difficult on short texts, the different results have revealed interesting information on the performances of the features and classification techniques on Arabic text data. For instance, character-based features appear to be better than word-based features for short texts. Furthermore, the proposed VBF fusion provided high performances with an accuracy of 90% of good AA, which is higher than the score of the original classifier using only one feature. Globally, the results of this investigation shed light on the efficiency and pertinency of several features and classifiers in AA of short Arabic texts.

[1]  Stella Markantonatou,et al.  Applying the SOM Model to Text Classification According to Register and Stylistic Content , 2003, Int. J. Neural Syst..

[2]  Justin Zobel,et al.  Effective and Scalable Authorship Attribution Using Function Words , 2005, AIRS.

[3]  Shlomo Argamon,et al.  Style mining of electronic messages for multiple authorship discrimination: first results , 2003, KDD '03.

[4]  Siham Ouamour,et al.  Authorship attribution of ancient texts written by ten Arabic travelers using character N-Grams , 2013, 2013 International Conference on Computer, Information and Telecommunication Systems (CITS).

[5]  Hsinchun Chen,et al.  A framework for authorship identification of online messages: Writing-style features and classification techniques , 2006 .

[6]  Ophir Frieder,et al.  Discrimination of Authorship Using Visualization , 1994, Inf. Process. Manag..

[7]  Kareem Shaker Investigating features and techniques for Arabic authoriship attribution , 2012 .

[8]  T C Mendenhall,et al.  THE CHARACTERISTIC CURVES OF COMPOSITION. , 1887, Science.

[9]  Efstathios Stamatatos,et al.  Automatic Text Categorization In Terms Of Genre and Author , 2000, CL.

[10]  Halim Sayoud,et al.  Author discrimination between the Holy Quran and Prophet's statements , 2012, Lit. Linguistic Comput..

[11]  Moshe Koppel,et al.  Exploiting Stylistic Idiosyncrasies for Authorship Attribution , 2003 .

[12]  Fuchun Peng,et al.  N-GRAM-BASED AUTHOR PROFILES FOR AUTHORSHIP ATTRIBUTION , 2003 .

[13]  Halim Sayoud,et al.  Effect of the Text Size on Stylometry - Application on Arabic Religious Texts , 2016, ICCSAMA.

[14]  David I. Holmes,et al.  Feature-Finding for Text Classification , 1996 .

[15]  Dhruba Kumar Bhattacharyya,et al.  Plagiarism: Taxonomy, Tools and Detection Techniques , 2018, ArXiv.

[16]  Halim Sayoud Segmental Analysis-Based Authorship Discrimination between the Holy Quran and Prophet’s Statements , 2015 .

[17]  Dale Schuurmans,et al.  Language independent authorship attribution using character level language models , 2003, Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - EACL '03.

[18]  Khedija Arour,et al.  A Binary Decision Diagram to discover low threshold support frequent itemsets , 2007 .

[19]  Shlomo Argamon,et al.  Stylistic text classification using functional lexical features , 2007, J. Assoc. Inf. Sci. Technol..

[20]  Matthew L. Jockers,et al.  A comparative study of machine learning methods for authorship attribution , 2010, Lit. Linguistic Comput..

[21]  Hsinchun Chen,et al.  Applying authorship analysis to extremist-group Web forum messages , 2005, IEEE Intelligent Systems.

[22]  Arun Ross,et al.  An introduction to biometric recognition , 2004, IEEE Transactions on Circuits and Systems for Video Technology.

[23]  Simon Günter,et al.  Short Text Authorship Attribution via Sequence Kernels, Markov Chains and Author Unmasking: An Investigation , 2006, EMNLP.

[24]  Patrick Juola,et al.  Authorship Attribution , 2008, Found. Trends Inf. Retr..

[25]  John Burrows,et al.  Word-Patterns and Story-Shapes: The Statistical Analysis of Narrative Style , 1987 .

[26]  Efstathios Stamatatos Author Identification Using Imbalanced and Limited Training Texts , 2007 .

[27]  Dale Schuurmans,et al.  Augmenting Naive Bayes Classifiers with Statistical Language Models , 2004, Information Retrieval.