论文信息 - Contemporary Amharic Corpus: Automatically Morpho-Syntactically Tagged Amharic Corpus

Contemporary Amharic Corpus: Automatically Morpho-Syntactically Tagged Amharic Corpus

We introduce the contemporary Amharic corpus, which is automatically tagged for morphosyntactic information. Texts are collected from 25,199 documents from different domains and about 24 million orthographic words are tokenized. Since it is partly a web corpus, we made some automatic spelling error corrections. We have also modified the existing morphological analyzer, HornMorpho, to use it for automatic tagging.

Michael Gasser | Andreas Nürnberger | Binyam Ephrem Seyoum | Andargachew Mekonnen Gezmu

[1] Iryna Gurevych,et al. Automatic Annotation Suggestions and Custom Annotation Layers in WebAnno , 2014, ACL.

[2] Besufikad Alemu. A Named Entity Recognition for Amharic , 2013 .

[3] Kevin P. Scannell. The Crúbadán Project: Corpus building for under-resourced languages , 2007 .

[4] Pavel Rychlý,et al. Annotated Amharic Corpora , 2016, TSD.

[5] M. Gasser. HornMorpho: a system for morphological processing of Amharic, Oromo, and Tigrinya , 2011 .

[6] Wolfgang Menzel,et al. Amharic Part-of-Speech Tagger for Factored Language Modeling , 2009, RANLP.

[7] Fredrik Olsson,et al. Methods for Amharic Part-of-Speech Tagging , 2009 .

[8] Yusuke Miyao,et al. Universal Dependencies for Amharic , 2018, LREC.

[9] Andreas Nürnberger,et al. Portable Spelling Corrector for a Less-Resourced Language: Amharic , 2018, LREC.

[10] Atelach Alemu Argaw,et al. An Amharic Stemmer : Reducing Words to their Citation Forms , 2007, SEMITIC@ACL.

[11] Yaregal Assabie,et al. Amharic Sentence Parsing Using Base Phrase Chunking , 2014, CICLing.

[12] Jan W. Amtrup. Morphology in Machine Translation Systems: Efficient Integration of Finite State Transducers and Feature Structure Descriptions , 2004, Machine Translation.