Cross Language POS Taggers (and other Tools) for Indian Languages: An Experiment with Kannada using Telugu Resources

Indian languages are known to have a large speaker base, yet some of these languages have minimal or non-efficient linguistic resources. For example, Kannada is relatively resource-poor compared to Malayalam, Tamil and Telugu, which in-turn are relatively poor compared to Hindi. Many Indian language pairs exhibit high similarities in morphology and syntactic behaviour e.g. Kannada is highly similar to Telugu. In this paper, we show how to build a cross-language part-of-speech tagger for Kannada exploiting the resources of Telugu. We also build large corpora and a morphological analyser (including lemmatisation) for Kannada. Our experiments reveal that a cross-language taggers are as efficient as mono-lingual taggers. We aim to extend our work to other Indian languages. Our tools are efficient and significantly faster than the existing monolingual tools.

[1]  Helmut Schmid,et al.  Estimation of Conditional Probabilities With Decision Trees and an Application to Fine-Grained POS Tagging , 2008, COLING.

[2]  P RamakanthKumar,et al.  Kannada Morphological Analyser and Generator Using Trie , 2011 .

[3]  Regina Barzilay,et al.  Multilingual Part-of-Speech Tagging: Two Unsupervised Approaches , 2009, J. Artif. Intell. Res..

[4]  Avinesh Pvs,et al.  Part-Of-Speech Tagging and Chunking using Conditional Random Fields and Transformation Based Learning , 2006 .

[5]  David Yarowsky,et al.  Inducing Multilingual POS Taggers and NP Bracketers via Robust Projection Across Aligned Corpora , 2001, NAACL.

[6]  Shalini R. Urs,et al.  Development of Prototype Morphological Analyzer for he South Indian Language of Kannada , 2007, ICADL.

[7]  David Yarowsky,et al.  Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora , 2001, HLT.

[8]  K. P. Soman,et al.  Kernel based part of speech tagger for Kannada , 2010, 2010 International Conference on Machine Learning and Cybernetics.

[9]  B. T. S. Atkins,et al.  The Oxford Guide to Practical Lexicography , 2008 .

[10]  Chris Brew,et al.  A Cross-language Approach to Rapid Creation of New Morpho-syntactically Annotated Resources , 2006, LREC.

[11]  Dipti Misra Sharma,et al.  AnnCorra : Annotating Corpora Guidelines For POS And Chunk Annotation For Indian Languages , 2008 .

[12]  Chris Brew,et al.  A Resource-light Approach to Russian Morphology: Tagging Russian using Czech resources , 2004, EMNLP.

[13]  Regina Barzilay,et al.  Unsupervised Multilingual Learning for POS Tagging , 2008, EMNLP.

[14]  K. P. Soman,et al.  Paradigm based morphological analyzer for kannada language using machine learning approach , 2010 .

[15]  P. Mannem,et al.  Introduction to the Shallow Parsing Contest for South Asian Languages , 2022 .

[16]  Adam Kilgarriff,et al.  Introduction to the Special Issue on the Web as Corpus , 2003, CL.

[17]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[18]  Slav Petrov,et al.  Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections , 2011, ACL.

[19]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[20]  Adam Kilgarriff,et al.  A Corpus Factory for Many Languages , 2010, LREC.