Author Identification using Sequential Minimal Optimization with rule-based Decision Tree on Indian Literature in Marathi

Abstract Authorship Identification is the task of identifying who wrote a given piece of text from a given set of candidate authors (suspects). The increasingly large volumes of texts on the Internet enhance the great yet urgent necessity for authorship identification. For this purpose, a large amount of work has already been done for the English language. Comparatively, less research has been carried out for Indian regional languages such as Tamil, Telugu, Bengali and Punjabi whereas no such experiment is available for Marathi. In this study presented a strategy for authorship identification of the documents written in Marathi language. Moreover, we adopted a set of fine-grained lexical and stylistic features for the analysis of the text and used them to develop two different models (statistical similarity model and SMORDT-Sequential minimal optimization with rule- based Decision Tree approach). Then, we validated the feature extraction method to show consistent significance in every model used in this experiment. The performance of the proposed approach has been evaluated based on the values of Recall, Precision, F-measure and Accuracy.

[1]  Saleh Alsaleem,et al.  Automated Arabic Text Categorization Using SVM and NB , 2011, Int. Arab. J. e Technol..

[2]  Cathy Cavanaugh Effectiveness of Cyber Charter Schools: A Review of Research on Learnings , 2009 .

[3]  Sunil Digamberrao Kale,et al.  A Systematic Review on Author Identification Methods , 2017, Int. J. Rough Sets Data Anal..

[4]  Prashant M. Kakde,et al.  A Comparative Analysis of Particle Swarm Optimization and Support Vector Machines for Devnagri Character Recognition: An Android Application☆ , 2016 .

[5]  Tatiana Litvinova,et al.  Machine Learning Models of Text Categorization by Author Gender Using Topic-independent Features , 2016 .

[6]  Parshuram M. Kamble,et al.  Handwritten Marathi character recognition using R -HOG Feature , 2015 .

[7]  Urmila Shrawankar,et al.  Transliteration of Secured SMS to Indian Regional Language , 2016 .

[8]  Nagmani Wanjari,et al.  Sentence Boundary Detection For Marathi Language , 2016 .

[9]  Jacques Savoy,et al.  Distance measures in author profiling , 2017, Information Processing & Management.

[10]  Efstathios Stamatatos,et al.  Author identification: Using text sampling to handle the class imbalance problem , 2008, Inf. Process. Manag..

[11]  Rao Muhammad Adeel Nawab,et al.  Multilingual author profiling on Facebook , 2017, Inf. Process. Manag..

[12]  Javier Del Ser,et al.  A feature selection method for author identification in interactive communications based on supervised learning and language typicality , 2016, Eng. Appl. Artif. Intell..

[13]  Rajarathnam Chandramouli,et al.  Author gender identification from text , 2011, Digit. Investig..

[14]  Johan F. Hoorn,et al.  Neural network identification of poets using letter sequences , 1999 .

[15]  Mohamed El Bachir Menai,et al.  Naïve Bayes classifiers for authorship attribution of Arabic texts , 2014, J. King Saud Univ. Comput. Inf. Sci..

[16]  A. Pandian,et al.  Authorship Identification for Tamil Classical Poem (Mukkoodar Pallu) using C4.5 Algorithm , 2016 .

[17]  Efstathios Stamatatos,et al.  Plagiarism detection using stopword n-grams , 2011, J. Assoc. Inf. Sci. Technol..

[18]  Grzegorz Baron,et al.  Influence of Data Discretization on Efficiency of Bayesian Classifier for Authorship Attribution , 2014, KES.

[19]  Mahmoud Al-Ayyoub,et al.  Author gender identification from Arabic text , 2017, J. Inf. Secur. Appl..

[20]  A. Vinaya Babu,et al.  Influence of lexical, syntactic and structural features and their combination on Authorship Attribution for Telugu Text , 2015 .

[21]  Patrick Juola,et al.  Large-Scale Experiments in Authorship Attribution , 2012 .

[22]  Benno Stein,et al.  Overview of the 4th Author Profiling Task at PAN 2016: Cross-Genre Evaluations , 2016, CLEF.

[23]  Paolo Rosso,et al.  Bridging the Native Language and Language Variety Identification Tasks , 2017, KES.