Text Classification for Authorship Attribution Using Naive Bayes Classifier with Limited Training Data

Authorship attribution (AA) is the task of identifying authors of disputed or anonymous texts. It can be seen as a single, multi-class text classification task. It is concerned with writing style rather than topic matter. The scalability issue in traditional AA studies concerns the effect of data size, the amount of data per candidate author. This has not been probed in much depth yet, since most stylometry researches tend to focus on long texts per author or multiple short texts, because stylistic choices frequently occur less in such short texts. This paper investigates the task of authorship attribution on short historical Arabic texts written by10 different authors. Several experiments are conducted on these texts by extracting various lexical and character features of the writing style of each author, using N-grams word level (1,2,3, and 4) and character level (1,2,3, and 4) grams as a text representation. Then Naive Bayes (NB) classifier is employed in order to classify the texts to their authors. This is to show robustness of NB classifier in doing AA on very short-sized texts when compared to Support Vector Machines (SVMs). Using dataset (called AAAT) which consists of 3 short texts per author’s book, it is shown our method is at least as effective as Information Gain (IG) for the selection of the most significant n-grams. Moreover, the significance of punctuation marks is explored in order to distinguish between authors, showing that an increase in the performance can be achieved. As well, the NB classifier achieved high accuracy results. Since the experiments of AA task that are done on AAAT dataset show interesting results with a classification accuracy of the best score obtained up to 96% using N-gram word level 1gram. Keywords: Authorship attribution, Text classification, Naive Bayes classifier, Character n-grams features, Word n-grams features.

[1]  Ekin Ekinci,et al.  Character Level Authorship Attribution for Turkish Text Documents , 2012 .

[2]  Erol Gelenbe,et al.  Computer and Information Sciences - 3 , 1989 .

[3]  Ying Zhao,et al.  Effective authorship attribution in large document collections , 2007 .

[4]  H. Sayoud,et al.  Authorship attribution of ancient texts written by ten arabic travelers using a SMO-SVM classifier , 2012, 2012 International Conference on Communications and Information Technology (ICCIT).

[5]  Simon Günter,et al.  Short Text Authorship Attribution via Sequence Kernels, Markov Chains and Author Unmasking: An Investigation , 2006, EMNLP.

[6]  Fabio Crestani,et al.  Evaluation of an interactive topic detection and tracking interface , 2012, J. Inf. Sci..

[7]  Anirban Dasgupta,et al.  Feature selection methods for text classification , 2007, KDD '07.

[8]  Derek Abbott,et al.  Advanced text authorship detection methods and their application to biblical texts , 2005, SPIE Micro + Nano Materials, Devices, and Applications.

[9]  Maciej Eder,et al.  Does size matter? Authorship attribution, small samples, big problem , 2015, Digit. Scholarsh. Humanit..

[10]  M. Sudheep Elayidom,et al.  Text Classification For Authorship Attribution Analysis , 2013, ArXiv.

[11]  Sarah R. Boutwell Authorship Attribution of Short Messages Using Multimodal Features , 2011 .

[12]  Sotiris Kotsiantis,et al.  Text Classification Using Machine Learning Techniques , 2005 .

[13]  lker Nadi Bozkurt Performance of various features and classification methods , .

[14]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[15]  Jack Grieve,et al.  Quantitative Authorship Attribution: An Evaluation of Techniques , 2007, Lit. Linguistic Comput..

[16]  John Burrows,et al.  All the Way Through: Testing for Authorship in Different Frequency Strata , 2007, Lit. Linguistic Comput..

[17]  S. Padmavathi Applying Naive Bayes Data Mining Technique for Classification of Agricultural Land Soils , 2009 .

[18]  Flora S. Tsai,et al.  Authorship Identification for Online Text , 2010, 2010 International Conference on Cyberworlds.

[19]  Efstathios Stamatatos,et al.  N-Gram Feature Selection for Authorship Identification , 2006, AIMSA.

[20]  Kim Luyckx,et al.  Scalability Issues in Authorship Attribution , 2011 .

[21]  Moshe Koppel,et al.  Measuring Differentiability: Unmasking Pseudonymous Authors , 2007, J. Mach. Learn. Res..