Comparison of different POS tagging techniques for some South Asian languages

There are different approaches to the problem of assigning a part of speech (POS) tag to each word of a natural language sentence. We present a comparison of the different approaches of POS tagging for the Bangla language and two other South Asian languages, as well as the baseline performances of different POS tagging techniques for the English language. The most widely used methods for English are the statistical methods i.e. n-gram based tagging or Hidden Markov Model (HMM) based tagging, the rule based or transformation based methods i.e. Brill’s tagger. Subsequent researches add various modifications to these basic approaches to improve the performance of the taggers for English. Here, we present an elaborate review of previous work in the area with the focus on South Asian Languages such as Hindi and Bangla. We experiment with Brill’s transformation based tagger and the supervised HMM based tagger without modifications for added improvement in accuracy, on English using training corpora of different sizes from the Brown corpus. We also compare the performances of these taggers on three South Asian languages with the focus on Bangla using two different tagsets and corpora of different sizes, which reveals that Brill's transformation based tagger performs considerably well for South Asian languages. We also check the baseline performances of the taggers for English and try to conclude how these approaches might perform if we use a considerable amount of annotated training corpus.

[1]  ABOUT IIT BOMBAY & , 2022 .

[2]  Leonid Peshkin,et al.  Part-of-speech tagging with minimal lexicalization , 2003, RANLP.

[3]  Gertjan van Noord,et al.  Unsupervised POS-Tagging Improves Parsing Accuracy and Parsing Efficiency , 2001, IWPT.

[4]  篠田 浩一 私のすすめるこの一冊 ; Spoken Launguage Processing: A Guide to Theory, Algorithm, and System Development, Xuedong Huang, Alex Acero and Hsiao-Wuen Hon, Prentice Hall, 2001 年 , 2003 .

[5]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[6]  Virginia Teller Review of Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition by Daniel Jurafsky and James H. Martin. Prentice Hall 2000. , 2000 .

[7]  Atro Voutilainen Does tagging help parsing? A case study on finite state parsing , 1998 .

[8]  Eric Brill,et al.  Some Advances in Transformation-Based Part of Speech Tagging , 1994, AAAI.

[9]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[10]  David Elworthy,et al.  Does Baum-Welch Re-estimation Help Taggers? , 1994, ANLP.

[11]  Penelope Sibun,et al.  A Practical Part-of-Speech Tagger , 1992, ANLP.

[12]  Anirudh Mani,et al.  Part of Speech Tagging and Chunking with Conditional Random Fields , 2022 .

[13]  Eric Brill,et al.  Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging , 1995, VLC@ACL.

[14]  Andrew MacKinlay,et al.  The effects of part-of-speech tagsets on tagger performance , 2005 .

[15]  Sudeshna Sarkar,et al.  A Hybrid Model for Part-of-Speech Tagging and its Application to Bengali , 2004, International Conference on Computational Intelligence.

[16]  Sivaji Bandyopadhyay,et al.  HMM Based POS Tagger and Rule-Based Chunker for Bengali , 2006 .

[17]  Steven J. DeRose,et al.  Grammatical Category Disambiguation by Statistical Optimization , 1988, CL.

[18]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[19]  Sudeshna Sarkar,et al.  Part of Speech Tagging for Bengali with Hidden Markov Model , 2006 .

[20]  Yves Schabes,et al.  Deterministic Part-of-Speech Tagging with Finite-State Transducers , 1995, Comput. Linguistics.

[21]  J.A. Perez-Ortiz,et al.  Part-of-speech tagging with recurrent neural networks , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[22]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[23]  Bernard Mérialdo,et al.  Natural Language Modeling for Phoneme-to-Text Transcription , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Balaraman Ravindran,et al.  Part Of Speech Tagging and Chunking with HMM and CRF , 2006 .

[25]  Patrick Schone,et al.  Language-independent Induction of Part of Speech Class Labels Using Only Language Universals , 2001, IJCAI 2001.

[26]  Mary P. Harper,et al.  A Second-Order Hidden Markov Model for Part-of-Speech Tagging , 1999, ACL.

[27]  Mihai Pop Unsupervised Part-of-speech Tagging , 1996 .

[28]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[29]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[30]  Atro Voutilainen Part-of-Speech Tagging , 2005 .

[31]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[32]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[33]  Q.I. Wang,et al.  Improved estimation for unsupervised part-of-speech tagging , 2005, 2005 International Conference on Natural Language Processing and Knowledge Engineering.