Authorship Attribution of Short Historical Arabic Texts Based on Lexical Features

In this paper the authors investigate the authorship of several short historical texts that are written by ten ancient Arabic travelers: this Arabic dataset, which was collected by the authors in 2011, is called AAAT dataset. Several experiments of authorship attribution are conducted on these Arabic texts, by using different lexical features such as words, word-big rams, word-trig rams, word-tetra grams and rare words. Furthermore, seven different classifiers are employed, namely: Manhattan distance, Cosine distance, Stamatatos distance, Camberra distance, Multi Layer Perceptron (MLP), Sequential Minimal Optimization based Support Vector Machine (SMO-SVM) and Linear Regression. For the evaluation task, several experiments of authorship attribution are conducted on the AAAT dataset by using the different quoted features and classifiers. Results show good attribution performances with an optimal score of 80% of good authorship attribution. Moreover, this investigation has revealed interesting results concerning the Arabic language and more particularly for the short texts.

[1]  I.N. Bozkurt,et al.  Authorship attribution , 2007, 2007 22nd international symposium on computer and information sciences.

[2]  Efstathios Stamatatos,et al.  Author Identification Using Imbalanced and Limited Training Texts , 2007, 18th International Workshop on Database and Expert Systems Applications (DEXA 2007).

[3]  Maciej Eder,et al.  Does size matter? Authorship attribution, small samples, big problem , 2015, Digit. Scholarsh. Humanit..

[4]  Dominic Forest Application de techniques de forage de textes de nature prédictive et exploratoire à des fins de gestion et d'analyse thématique de documents textuels non structurés , 2006 .

[5]  Wei Pan,et al.  Linear regression and two-class classification with gene expression data , 2003, Bioinform..

[6]  Khedija Arour,et al.  A Binary Decision Diagram to discover low threshold support frequent itemsets , 2007 .

[7]  S. Sathiya Keerthi,et al.  Improvements to Platt's SMO Algorithm for SVM Classifier Design , 2001, Neural Computation.

[8]  Halim Sayoud,et al.  Author discrimination between the Holy Quran and Prophet's statements , 2012, Lit. Linguistic Comput..

[9]  H. Sayoud,et al.  Authorship attribution of ancient texts written by ten arabic travelers using a SMO-SVM classifier , 2012, 2012 International Conference on Communications and Information Technology (ICCIT).

[10]  Derek Abbott,et al.  Advanced text authorship detection methods and their application to biblical texts , 2005, SPIE Micro + Nano Materials, Devices, and Applications.

[11]  Ian H. Witten,et al.  Weka: Practical machine learning tools and techniques with Java implementations , 1999 .

[12]  Stefan Conrad,et al.  A Set-Based Approach to Plagiarism Detection , 2012, CLEF.

[13]  Stella Markantonatou,et al.  Discriminating the registers and styles in the Modern Greek language , 2000 .