Morphological analyzer and generator for Tamil Language

Morphological analysis is an essential component in Natural Language Processing (NLP) applications ranging from spell checker to machine translation. When performing a morphological analysis it leads to segmentation of a word into morphemes, combined with an analysis of the attachments of these morphemes. In English language the complexity of the formation of words is not much higher compared with Indic languages. Hence, Tamil language too does have its complexities when building up a NLP application. The morphemes in the language, the rules how these morphemes are connected and the changes occur when they attach together are the important factors that need to be considered when building up a Morphological Analyzer for any language. Our “Morphological Analyzer and Generator for Tamil Language” will be generating the word forms of a stem/ root, given a particular context and at the same time, a surface form in Tamil language should get analyzed into its proper context. This model tries to cover only the nouns and verbs in the Tamil language. This paper illustrates how the lexicon and the orthographic rules of Tamil language have been written as regular expressions using only finite state operations and how this approach has been implemented to develop a morphological analyzer/generator. This model is built using the Xerox toolkit, which uses “Two-level Morphology”, and almost 2000 noun stems and 96 verb stems have been incorporated into the network. A noun stem now produces about 40 different forms and a verb stem produces up to 240 forms. We have also defined our own transliteration scheme for this purpose.