An Investigation of Supervised Learning Methods for Authorship Attribution in Short Hinglish Texts using Char & Word N-grams

The writing style of a person can be affirmed as a unique identity indicator; the words used, and the structuring of the sentences are clear measures which can identify the author of a specific work. Stylometry and its subset - Authorship Attribution, have a long history beginning from the 19th century, and we can still find their use in modern times. The emergence of the Internet has shifted the application of attribution studies towards non-standard texts that are comparatively shorter to and different from the long texts on which most research has been done. The aim of this paper focuses on the study of short online texts, retrieved from messaging application called WhatsApp and studying the distinctive features of a macaronic language (Hinglish), using supervised learning methods and then comparing the models. Various features such as word n-gram and character n-gram are compared via methods viz., Naive Bayes Classifier, Support Vector Machine, Conditional Tree, and Random Forest, to find the best discriminator for such corpora. Our results showed that SVM attained a test accuracy of up to 95.079% while similarly, Naive Bayes attained an accuracy of up to 94.455% for the dataset. Conditional Tree & Random Forest failed to perform as well as expected. We also found that word unigram and character 3-grams features were more likely to distinguish authors accurately than other features.

[1]  Simon Bernard,et al.  Random Forest Classifiers : A Survey and Future Research Directions , 2013 .

[2]  Rahel Oppliger,et al.  Automatic authorship attribution based on character n-grams in Swiss German , 2016, KONVENS.

[3]  Efstathios Stamatatos,et al.  Automatic Text Categorization In Terms Of Genre and Author , 2000, CL.

[4]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[5]  David I. Holmes,et al.  Feature-Finding for Text Classification , 1996 .

[6]  Chin-Teng Lin,et al.  Support-vector-based fuzzy neural network for pattern classification , 2006, IEEE Transactions on Fuzzy Systems.

[7]  E. Stamatatos Ensemble-based Author Identification Using Character N-grams , 2006 .

[8]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[9]  Ophir Frieder,et al.  Discrimination of Authorship Using Visualization , 1994, Inf. Process. Manag..

[10]  Yasumasa Kanada,et al.  Extraction of Authors' Charateristics from japanese Modern Setences via N-gram Distribution , 2000, Discovery Science.

[11]  Efstathios Stamatatos,et al.  N-Gram Feature Selection for Authorship Identification , 2006, AIMSA.

[12]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[13]  Jack Grieve,et al.  Quantitative Authorship Attribution: An Evaluation of Techniques , 2007, Lit. Linguistic Comput..

[14]  Roy Schwartz,et al.  Authorship Attribution of Micro-Messages , 2013, EMNLP.

[15]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[16]  Patrick Juola,et al.  Authorship Attribution , 2008, Found. Trends Inf. Retr..

[17]  Patrick Juola,et al.  Future Trends in Authorship Attribution , 2007, IFIP Int. Conf. Digital Forensics.

[18]  Lipo Wang,et al.  Gene expression data analysis using support vector machines , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[19]  Patrick Juola,et al.  Authorship Attribution for Electronic Documents , 2006, IFIP Int. Conf. Digital Forensics.

[20]  Efstathios Stamatatos,et al.  Authorship Attribution Based on Feature Set Subspacing Ensembles , 2006, Int. J. Artif. Intell. Tools.

[21]  Bernard F. Buxton,et al.  Drug Design by Machine Learning: Support Vector Machines for Pharmaceutical Data Analysis , 2001, Comput. Chem..

[22]  Johan F. Hoorn,et al.  Neural network identification of poets using letter sequences , 1999 .

[23]  Dale Schuurmans,et al.  Augmenting Naive Bayes Classifiers with Statistical Language Models , 2004, Information Retrieval.

[24]  Emmanuel Ahishakiye,et al.  Crime Prediction Using Decision Tree (J48) Classification Algorithm. , 2017 .

[25]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[26]  Paolo Rosso,et al.  Authorship Attribution Using Word Sequences , 2006, CIARP.

[27]  Simon Günter,et al.  Short Text Authorship Attribution via Sequence Kernels, Markov Chains and Author Unmasking: An Investigation , 2006, EMNLP.

[28]  Michael Gamon,et al.  Linguistic correlates of style: authorship classification with deep linguistic analysis features , 2004, COLING.