论文信息 - A Tagged Corpus and a Tagger for Urdu

A Tagged Corpus and a Tagger for Urdu

In this paper, we describe a release of a sizeable monolingual Urdu corpus automatically tagged with part-of-speech tags. We extend the work of Jawaid and Bojar (2012) who use three different taggers and then apply a voting scheme to disambiguate among the different choices suggested by each tagger. We run this complex ensemble on a large monolingual corpus and release the tagged corpus. Additionally, we use this data to train a single standalone tagger which will hopefully significantly simplify Urdu processing. The standalone tagger obtains the accuracy of 88.74% on test data.

Ondrej Bojar | Bushra Jawaid | Amir Kamran

[1] Lluís Màrquez i Villodre,et al. SVMTool: A general POS Tagger Generator Based on Support Vector Machines , 2004, LREC.

[2] Tony McEnery,et al. EMILLE, A 67-Million Word Corpus of Indic Languages: Data Collection, Mark-up and Harmonisation , 2002, LREC.

[3] Sarmad Hussain,et al. Study of Noun Phrase in Urdu , 2016 .

[4] Hassan Sajjad,et al. Tagging Urdu Text with Parts of Speech: A Tagger Comparison , 2009, EACL.

[5] Andrew Hardie,et al. Developing a tagset for automated part-of-speech tagging in Urdu. , 2003 .

[6] Ondrej Bojar,et al. Tagger Voting for Urdu , 2012, WSSANLP@COLING.

[7] Harald Hammarström,et al. Urdu Morphology, Orthography and Lexicon Extraction , 2007 .