论文信息 - Developing a tagset for automated part-of-speech tagging in Urdu.

Developing a tagset for automated part-of-speech tagging in Urdu.

While part-of-speech tagging is an established technology for Western European languages such as English or Spanish, extending the technique to Urdu presents a range of interesting issues. There are some problems associated with the writing system, e.g. the problems of locating token boundaries in the Urdu version of the Arabic script. However, there are also linguistic issues. Little work has hitherto been done in the area of tagset creation for Urdu. The tagset discussed here was created in accordance with the EAGLES guidelines for morphosyntactic annotation of corpora. Although these guidelines were written to cover the languages of the European Union, they can be applied fairly easily to Urdu, which, coming as it does from another branch of the Indo- European family, is structurally quite similar. They can also be extended to deal with the idiosyncrasies presented by Urdu grammar. This paper will look at the process of creating one of the necessary resources for the development of a POS tagging system for Urdu, that of a suitable tagset, considering some of the problems encountered along the way.

Andrew Hardie | A. Hardie

[1] Eric Brill,et al. Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging , 1995, VLC@ACL.

[2] Tony McEnery,et al. Corpus data for South Asian language processing. , 2003 .

[3] Miriam Butt. The Structure of Complex Predicates in Urdu , 1995 .

[4] Roger Garside,et al. An Arabic tagset for the morphosyntactic tagging of Arabic , 2001 .

[5] Geoffrey Leech,et al. Corpus Annotation: Linguistic Information from Computer Text Corpora , 1997 .

[6] S. H. Kellogg,et al. Grammar of the Hindi Language , 1989 .

[7] Bernard Mérialdo,et al. Tagging English Text with a Probabilistic Model , 1994, CL.

[8] Key-Sun Choi,et al. Introduction of KIBS (Korean Information Base System) Project , 2000, LREC.

[9] Akira Nakanishi,et al. Writing Systems of the World , 1980 .

[10] John T. Platts,et al. A dictionary of Urdū, classical Hindī, and English , 1961 .

[11] Hans van Halteren,et al. Syntactic Wordclass Tagging , 1999 .

[12] David Yarowsky,et al. Techniques in Speech Acoustics , 1999, Computational Linguistics.

[13] Maria Gavrilidou. Second International Conference on Language Resources and Evaluation proceedings , 2000 .

[14] Bernard Comrie,et al. The Major languages of South Asia, the Middle East and Africa , 1990 .

[15] Colin P. Masica. The Indo-Aryan Languages , 1991 .

[16] Michael C. Shapiro. An introduction to Hindi and Urdu , 1980 .