In stylometry, there are two important technical questions: Firstly, does the text size affect the authorship attribution performances? and secondly, what could be the effect of the language on that attribution? To respond to those questions, we have conducted several experiments of authorship attribution applied on multi-size text documents. The text size varies from 100 words to 3000 words per document. For that purpose, a specific Arabic dataset has been conceived (i.e. A4P corpus). The corpus is made available for the scientific community and is suitable for the task of stylometry since the genre and theme are quite similar. Two types of features are investigated: character n-grams and words, in association with several classifiers, namely: SVM, MLP, Linear regression, Stamatatos distance and Manhattan distance. During the experiments, 2 types of scores are proposed: the “Score of Good Attribution” and “Robustness against Size Reduction” ratio. Results are quite interesting, showing that the minimum text size required for performing a fair authorship attribution, depends on the feature and classification method that are employed. For the evaluation task, a specific application of authorship attribution has been conducted on 7 religious books, where the main purpose was to check whether the Quran and Hadith could have the same Author or not. Results have clearly shown that those two books should have 2 different Authors.
[1]
Maciej Eder,et al.
Does size matter? Authorship attribution, small samples, big problem
,
2015,
Digit. Scholarsh. Humanit..
[2]
Fuchun Peng,et al.
N-GRAM-BASED AUTHOR PROFILES FOR AUTHORSHIP ATTRIBUTION
,
2003
.
[3]
Dale Schuurmans,et al.
Text Classification in Asian Languages without Word Segmentation
,
2003
.
[4]
Halim Sayoud.
A Visual Analytics based Investigation on the Authorship of the Holy Quran
,
2015,
IVAPP.
[5]
S. Sathiya Keerthi,et al.
Improvements to Platt's SMO Algorithm for SVM Classifier Design
,
2001,
Neural Computation.
[6]
Efstathios Stamatatos.
A survey of modern authorship attribution methods
,
2009
.
[7]
George M. Mohay,et al.
Mining e-mail content for author identification forensics
,
2001,
SGMD.
[8]
Efstathios Stamatatos,et al.
Automatic Authorship Attribution
,
1999,
EACL.
[9]
Ian H. Witten,et al.
Data mining: practical machine learning tools and techniques, 3rd Edition
,
1999
.
[10]
Efstathios Stamatatos,et al.
On the Robustness of Authorship Attribution Based on Character N -gram Features
,
2013
.
[11]
Walter Daelemans,et al.
The effect of author set size and data size in authorship attribution
,
2011,
Lit. Linguistic Comput..