Hindi POS Tagger Using Naive Stemming : Harnessing Morphological Information Without Extensive Linguistic Knowledge

Part of Speech tagging for Indian Languages in general and Hindi in particular is not a very widely explored territory. There have been many attempts at developing a good POS tagger for Hindi, but the morphological complexity of the language makes it a hard nut to crack. Some of the best taggers available for Indian Languages employ hybrids of machine learning or stochastic methods and linguistic knowledge. Though, the results achieved using such methods are good, there practicability for other inflective Indian Languages is reduced due to their heavy dependence on linguistic knowledge. Even though taggers can achieve very good results if provided good morphological information, the cost of creating these resources renders such methods impractical. In this paper, we present a simple HMM based POS tagger, which employs a naive(longest suffix matching) stemmer as a pre-processor to achieve reasonably good accuracy of 93.12%. This method does not require any linguistic resource apart from a list of possible suffixes for the language. This list can be easily created using existing machine learning techniques. The aim of this method is to demonstrate that even without employing tools like morphological analyzer or resources like a pre-compiled structured lexicon, it is possible to harness the morphological richness of Indian Languages.

[1]  Akshar Bharati,et al.  Natural language processing : a Paninian perspective , 1996 .

[2]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[3]  Jan Hajic,et al.  Serial Combination of Rules and Statistics: A Case Study in Czech Tagging , 2001, ACL.

[4]  Kevin Duh,et al.  Jointly Labeling Multiple Sequences: A Factorial HMM Approach , 2005, ACL.

[5]  Penelope Sibun,et al.  A Practical Part-of-Speech Tagger , 1992, ANLP.

[6]  John A. Goldsmith,et al.  Unsupervised Learning of the Morphology of a Natural Language , 2001, CL.

[7]  Andreas Stolcke,et al.  Morphology-based language modeling for arabic speech recognition , 2004, INTERSPEECH.

[8]  Brendan S. Gillon Review of Natural language processing: a Paninian perspective by Akshar Bharati, Vineet Chaitanya, and Rajeev Sangal. Prentice-Hall of India 1995. , 1995 .

[9]  Hitoshi Isahara,et al.  The Unknown Word Problem: a Morphological Analysis of Japanese Using Maximum Entropy Aided by a Dictionary , 2001, EMNLP.

[10]  Sudeshna Sarkar,et al.  Part of Speech Tagging and Local Word Grouping Techniques for Natural Language Parsing in Hindi , 2003 .

[11]  Yamina Tlili-Guiassa Hybrid Method for Tagging Arabic Text , 2006 .

[12]  Sudeshna Sarkar,et al.  Automatic Part-of-Speech Tagging for Bengali: An Approach for Morphologically Rich Languages in a Poor Resource Scenario , 2007, ACL.

[13]  Geoffrey Leech,et al.  Corpus Annotation: Linguistic Information from Computer Text Corpora , 1997 .

[14]  ABOUT IIT BOMBAY & , 2022 .

[15]  Kevin Duh,et al.  POS Tagging of Dialectal Arabic: A Minimally Supervised Approach , 2005, SEMITIC@ACL.

[16]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[17]  John D. Lafferty,et al.  Decision Tree Models Applied to the Labeling of Text with Parts-of-Speech , 1992, HLT.

[18]  Roger Garside,et al.  A hybrid grammatical tagger: CLAWS4 , 1997 .

[19]  Kemal Oflazer,et al.  Tagging and Morphological Disambiguation of Turkish Text , 1994, ANLP.

[20]  Daphne Koller,et al.  Restricted Bayes Optimal Classifiers , 2000, AAAI/IAAI.

[21]  Pushpak Bhattacharyya,et al.  Morphological Richness Offsets Resource Demand - Experiences in Constructing a POS Tagger for Hindi , 2006, ACL.

[22]  Sudeshna Sarkar,et al.  A Hybrid Model for Part-of-Speech Tagging and its Application to Bengali , 2004, International Conference on Computational Intelligence.

[23]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[24]  Andreas S. Weigend,et al.  Time Series Prediction: Forecasting the Future and Understanding the Past , 1994 .

[25]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[26]  Atro Voutilainen,et al.  Comparing a Linguistic and a Stochastic Tagger , 1997, ACL.