Tagging and Verifying an Amharic News Corpus

The paper describes work on verifying, correcting and retagging a corpus of Amharic news texts. A total of 8715 Amharic news articles had previously been collected from a web site, and part of the corpus (1065 articles; 210,000 words) then morphologically analysed and manually part-of-speech tagged. The tagged corpus has been used as the basis for testing the application to Amharic of machine learning techniques and tools developed for other languages. This process made it possible to spot several errors and inconsistencies in the corpus which has been iteratively refined, cleaned, normalised, split into folds, and partially re-tagged by both automatic and manual means.

[1]  Solomon Teferra Abate,et al.  Part-of-Speech Tagging for Under-Resourced and Morphologically Rich Languages - The Case of Amharic , 2011 .

[2]  Fredrik Olsson,et al.  Methods for Amharic Part-of-Speech Tagging , 2009 .

[3]  Sisay Fissaha Adafre Part of Speech Tagging for Amharic using Conditional Random Fields , 2005, SEMITIC@ACL.

[4]  Martha Yifiru Tachbelie,et al.  Morphology-based language modeling for amharic , 2010 .

[5]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[6]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[7]  Walter Daelemans,et al.  MBT: A Memory-Based Part of Speech Tagger-Generator , 1996, VLC@COLING.

[8]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[9]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[10]  Jan Hajic,et al.  Semi-Supervised Training for the Averaged Perceptron POS Tagger , 2009, EACL.

[11]  Combining Character and Morpheme Based Models for Part-Of-Speech Tagging of Semitic Languages Research , 2008 .

[12]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[13]  Lluís Màrquez i Villodre,et al.  SVMTool: A general POS Tagger Generator Based on Support Vector Machines , 2004, LREC.

[14]  Atelach Alemu Argaw,et al.  Web Mining for an Amharic - English Bilingual Corpus , 2005, WEBIST.

[15]  Binyam Gebrekidan Gebre,et al.  Part of speech tagging for Amharic , 2010 .

[16]  M. Gasser HornMorpho: a system for morphological processing of Amharic, Oromo, and Tigrinya , 2011 .

[17]  G. Hudson,et al.  Linguistic Analysis of the 1994 Ethiopian Census , 2003, Northeast African Studies.

[18]  Khalil Sima'an,et al.  Part-of-speech tagging of Modern Hebrew text , 2008, Natural Language Engineering.