Designing punjabi poetry classifiers using machine learning and different textual features

Analysis of poetic text is very challenging from computational linguistic perspective. Computational analysis of literary arts, especially poetry, is very difficult task for classification. For library recommendation system, poetries can be classified on various metrics such as poet, time period, sentiments and subject matter. In this work, content-based Punjabi poetry classifier was developed using Weka toolset. Four different categories were manually populated with 2034 poems Nature and Festival (NAFE), Linguistic and Patriotic (LIPA), Relation and Romantic (RORE), Philosophy and Spiritual (PHSP) categories consists of 505, 399, 529 and 601 numbers of poetries, respectively. These poetries were passed to various pre-processing sub phases such as tokenization, noise removal, stop word removal, and special symbol removal. 31938 extracted tokens were weighted using Term Frequency (TF) and Term Frequency-Inverse Document Frequency (TF-IDF) weighting scheme. Based upon poetry elements, three different textual features (lexical, syntactic and semantic) were experimented to develop classifier using different machine learning algorithms. Naive Bayes (NB), Support Vector Machine, Hyper pipes and K-nearest neighbour algorithms were experimented with textual features. The results revealed that semantic feature performed better as compared to lexical and syntactic. The best performing algorithm is SVM and highest accuracy (76.02%) is achieved by incorporating semantic information associated with words.

[1]  Jatinderkumar R. Saini,et al.  A Study and Analysis of Opinion Mining Research in Indo-Aryan, Dravidian and Tibeto-Burman Language Families , 2014 .

[2]  Pushpak Bhattacharyya,et al.  Hindi Word Sense Disambiguation , 2004 .

[3]  Vishal Gupta,et al.  Automatic Stemming of Words for Punjabi Language , 2014, SIRS.

[4]  Pinar Duygulu Sahin,et al.  Automatic Categorization of Ottoman Literary Texts by Poet and Time Period , 2011, ISCIS.

[5]  Satyendr Singh,et al.  Utilizing corpus statistics for hindi word sense disambiguation , 2015, Int. Arab J. Inf. Technol..

[6]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[7]  Jatinderkumar R. Saini,et al.  Automatic Punjabi poetry classification using machine learning algorithms with reduced feature set , 2016, Int. J. Artif. Intell. Soft Comput..

[8]  Nada Ghneim,et al.  Emotion Classification in Arabic Poetry using Machine Learning , 2013 .

[9]  Jatinderkumar R. Saini,et al.  Punjabi Poetry Classification: The Test of 10 Machine Learning Algorithms , 2017, ICMLC.

[10]  Jatinderkumar R. Saini,et al.  Punjabi Stop Words: A Gurmukhi, Shahmukhi and Roman Scripted Chronicle , 2016, WIR '16.

[11]  Farbod Razzazi,et al.  Automatic meter classification in Persian poetries using support vector machines , 2009, 2009 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT).

[12]  Shahrul Azman Mohd Noah,et al.  Poetry Classification Using Support Vector Machines , 2012 .

[13]  Jatinderkumar R. Saini,et al.  A Natural Language Processing Approach for Identification of Stop Words in Punjabi Language , 2015 .

[14]  Pilar Rodríguez Marín,et al.  Automatic Classification of Literature Pieces by Emotion Detection: A Study on Quevedo's Poetry , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[15]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[16]  Gholamreza Haffari,et al.  Automated Analysis of Bangla Poetry for Classification and Poet Identification , 2015, ICON.