A Statistical Based Part of Speech Tagger for Urdu Language

In this paper we present a pioneering step in designing n-gram based part of speech tagger for the Urdu language. In the last few years part of speech tagging work has been done in the area of supposed English, South Asian and European languages. In this paper our focus of attention is on the disambiguation problem (to assign the accurate tag for every word of a set of possible tags). Our approach employs n-gram Markov model, train from annotated Urdu corpus and assigns possible tags to text. The proposed n-gram part of speech tagger has been tested which achieved state of the art performance of 95.0%. Furthermore, we check our experiment results of two type of tagset. Along the way, we apply evaluation method that shows how significant our experiment results are. Besides, we present the error analysis (confusion matrix) and show the tagging example of Urdu tagging. We also present overview of Urdu language. The contribution of our work is an initial step of statistical based Urdu part of speech tagger.