A New Approach to Tagging in Indian Languages

In this paper, we present a new approach to automatic tag- ging without requiring any machine learning algorithm or training data. We argue that the critical information required for tagging comes more from word internal structure than from the context and we show how a well designed morphological analyzer can assign correct tags and disam- biguate many cases of tag ambiguities too. The crux of the approach is in the very denition of words. While others simply tokenize a given sen- tence based on spaces and take these tokens to be words, we argue that words need to be motivated from semantic and syntactic considerations, not orthographic conventions. We have worked on Telugu and Kannada languages and in this paper, we take the example of Telugu language and show how high quality tagging can be achieved with a ne grained, hierarchical tag set, carrying not only morpho-syntactic information but also some aspects of lexical and semantic information that is necessary or useful for syntactic parsing. In fact entire corpora can be tagged very fast and with a good degree of guarantee of quality. We give details of our experiments and results obtained. We believe our approach can also be applied to other languages.

[1]  Helmut Schmid,et al.  Estimation of Conditional Probabilities With Decision Trees and an Application to Fine-Grained POS Tagging , 2008, COLING.

[2]  Steven Abney,et al.  Part-of-Speech Tagging and Partial Parsing , 1997 .

[3]  ABOUT IIT BOMBAY & , 2022 .

[4]  Pushpak Bhattacharyya,et al.  A Common Parts-of-Speech Tagset Framework for Indian Languages , 2008, LREC.

[5]  David Elworthy Tagset Design and Inflected Languages , 1995, ArXiv.

[6]  Anirudh Mani,et al.  Part of Speech Tagging and Chunking with Conditional Random Fields , 2022 .

[7]  Daniel Jurafsky,et al.  Morphological features help POS tagging of unknown words across language varieties , 2005, IJCNLP.

[8]  Sudeshna Sarkar,et al.  Part of Speech Tagging for Bengali with Hidden Markov Model , 2006 .

[9]  Eric Atwell,et al.  Development of tag sets for part-of-speech tagging , 2008 .

[10]  Sanford B. Steever,et al.  A grammar of modern Telugu , 1985 .

[11]  Jan Hajic,et al.  Morphological Tagging: Data vs. Dictionaries , 2000, ANLP.

[12]  Eric Atwell,et al.  Fine-Grain Morphological Analyzer and Part-of-Speech Tagger for Arabic Text , 2010, LREC.

[13]  Balaraman Ravindran,et al.  Part Of Speech Tagging and Chunking with HMM and CRF , 2006 .

[14]  Vijay Sundar Ram,et al.  Chunker and Hybrid POS Tagger for Indian Languages , 2006 .

[15]  Sudeshna Sarkar,et al.  A Hybrid Model for Part-of-Speech Tagging and its Application to Bengali , 2004, International Conference on Computational Intelligence.

[16]  Kavi Narayana Murthy,et al.  Statistical analyses of telugu text corpora , 2007 .

[17]  Steven J. DeRose,et al.  Grammatical Category Disambiguation by Statistical Optimization , 1988, CL.