FarsiTag: A part-of-speech tagging system for Persian

FarsiTag is a tagging system capable of assigning the most probable part-of-speech (POS) tags to Persian words in a text. In this system, some linguistic rules have been used to select the best POS tag for every Persian word. The present study aims to report the processes during which a robust tagging system—FarsiTag—was designed and implemented on Persian texts. A POS-tagged parallel corpus of English–Persian containing about 5,000,000 words has also been developed as a side-product of the mentioned tagger. An experiment has been conducted to evaluate the performance of the system while tagging unrestricted Persian texts. The highest rate of error traces back to medical and religious genres, while the lowest system error type is related to the scientific texts. The total error rate considering all domains is as low as 1.4%, with the overall system accuracy of 98.6% which is very promising for a language like Persian.

[1]  Fahimeh Raja,et al.  Evaluation of statistical part of speech tagging of persian text , 2007, 2007 9th International Symposium on Signal Processing and Its Applications.

[2]  Hinrich Schütze,et al.  Distributional Part-of-Speech Tagging , 1995, EACL.

[3]  Georgios Paliouras,et al.  Resolving Part-of-Speech Ambiguity in the Greek Language Using Learning Techniques , 1999, ArXiv.

[4]  Eric Brill,et al.  Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging , 1995, VLC@ACL.

[5]  Penelope Sibun,et al.  A Practical Part-of-Speech Tagger , 1992, ANLP.

[6]  Mojgan Seraji,et al.  A Basic Language Resource Kit for Persian , 2012, LREC.

[7]  Tayebeh Mosavi Miangah Constructing a Large-Scale English-Persian Parallel Corpus , 2009 .

[8]  András Kornai,et al.  HunPos: an open source trigram tagger , 2007, ACL 2007.

[9]  Mojgan Seraji,et al.  A Statistical Part-of-Speech Tagger for Persian , 2011, NODALIDA.

[10]  André Kempe Part-of-Speech Tagging with Two Sequential Transducers , 2000, CLIN.

[11]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[12]  S. Mostafa Assi,et al.  Grammatical Tagging of a Persian Corpus , 2000 .

[13]  Tayebeh Mosavi Miangah,et al.  UNSUPERVISED PART OF SPEECH TAGGING FOR PERSIAN , 2012 .

[14]  Dragomir R. Radev,et al.  Using Word Class for Part-of-speech Disambiguation , 1996, VLC@COLING.

[15]  Farhad Oroumchian,et al.  Evaluation of part of speech tagging on Persian text , 2007 .