Malayalam POS Tagger - A Comparison Using SVM and HMM

Many Parts Of Speech (POS) taggers for the Malayalam language has been implemented using Support Vector Machine (SVM), Memory-Based Language Processing (MBLP), Hidden Markov Model (HMM) and other similar techniques. The objective was to find an improved POS tagger for the Malayalam language. This work proposed a comparison of the Malayalam POS tagger using the SVM and Hidden Markov model (HMM). The tagset used was the popular Bureau of Indian Standard (BIS) tag set. A manually created data set which has around 52,000 words has been taken from various Malayalam news sites. The preprocessing steps that have done for news text are also mentioned. Then POS tagging has been done using SVM and HMM. As POS tagging requires the extraction of multiple class labels, a multi-class SVM is used. It also performs feature extraction, feature selection, and classification. The word sense disambiguation and misclassification of words are the two major issues identified in SVM. Hidden Markov Model predicts the hidden sequence based on maximum observation likelihood which reduces ambiguity and misclassification rate.

[1]  Hajo A. Reijers,et al.  Transforming unstructured natural language descriptions into measurable process performance indicators using Hidden Markov Models , 2017, Inf. Syst..

[2]  Hongfei Lin,et al.  A two-stage feature selection method for text categorization , 2010, 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery.

[3]  Bipul Syam Purkayastha,et al.  Hidden Markov Model based Part of Speech Tagging for Nepali language , 2015, 2015 International Symposium on Advanced Computing and Communication (ISACC).

[4]  Surangika Ranathunga,et al.  Comprehensive Part-Of-Speech Tag Set and SVM based POS Tagger for Sinhala , 2016, WSSANLP@COLING.

[5]  Li Li,et al.  Combining Lexical and Semantic Features for Short Text Classification , 2013, KES.

[6]  Dipankar Das,et al.  Part-of-speech Tagging of Code-Mixed Social Media Text , 2016, CodeSwitch@EMNLP.

[7]  Nisheeth Joshi,et al.  HMM BASED POS TAGGER FOR HINDI , 2013 .

[8]  Zoubin Ghahramani,et al.  The infinite HMM for unsupervised PoS tagging , 2009, EMNLP.

[9]  Hajo A. Reijers,et al.  Using Hidden Markov Models for the accurate linguistic analysis of process model activity labels , 2019, Inf. Syst..

[10]  K. P. Soman,et al.  On developing handwritten character image database for Malayalam language script , 2019, Engineering Science and Technology, an International Journal.

[11]  Gurpreet Singh Josan,et al.  Prediction of part of speech tags for punjabi using support vector machines , 2016, Int. Arab J. Inf. Technol..

[12]  Tai-Yue Wang,et al.  Fuzzy support vector machine for multi-class text categorization , 2007, Inf. Process. Manag..

[13]  Mohammad S. Khorsheed,et al.  Diacritizing Arabic Text Using a Single Hidden Markov Model , 2018, IEEE Access.

[14]  Hasan Fleyeh,et al.  Construction site accident analysis using text mining and natural language processing techniques , 2019, Automation in Construction.

[15]  Dipti Misra Sharma,et al.  Significance of an Accurate Sandhi-Splitter in Shallow Parsing of Dravidian Languages , 2016, ACL.

[16]  T. V. Geetha,et al.  CRF Models for Tamil Part of Speech Tagging and Chunking , 2009, ICCPOL.

[17]  Mark Johnson,et al.  Why Doesn’t EM Find Good HMM POS-Taggers? , 2007, EMNLP.

[18]  Hazem M. El-Bakry,et al.  Arabic Handwritten Characters Recognition Using Convolutional Neural Network , 2017 .

[19]  S. Karthik,et al.  Deep belief network based approach to recognize handwritten Kannada characters using distributed average of gradients , 2019, Cluster Computing.