The SAWA Corpus: A Parallel Corpus English - Swahili

Research in data-driven methods for Machine Translation has greatly benefited from the increasing availability of parallel corpora. Processing the same text in two different languages yields useful information on how words and phrases are translated from a source language into a target language. To investigate this, a parallel corpus is typically aligned by linking linguistic tokens in the source language to the corresponding units in the target language. An aligned parallel corpus therefore facilitates the automatic development of a machine translation system and can also bootstrap annotation through projection. In this paper, we describe data collection and annotation efforts and preliminary experimental results with a parallel corpus English - Swahili.

[1]  Guy De Pauw,et al.  Improving the Computational Morphological Analysis of a Swahili Corpus for Lexicographic Purposes , 2008 .

[2]  Gilles-Maurice de Schryver,et al.  Data-Driven Part-of-Speech Tagging of Kiswahili , 2006, TSD.

[3]  Gilles-Maurice de Schryver,et al.  Automatic Diacritic Restoration for Resource-Scarce Languages , 2007, TSD.

[4]  Robert C. Moore Fast and accurate sentence alignment of bilingual corpora , 2002, AMTA.

[5]  Sharon Goldwater,et al.  Improving Statistical MT through Morphological Analysis , 2005, HLT.

[6]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[7]  Peter Waiganjo Wagacha,et al.  A Grapheme-Based Approach for Accent Restoration in Gikuyu , 2006, LREC.

[8]  Peter Waiganjo Wagacha,et al.  Bootstrapping morphological analysis of gĩkũyũ using unsupervised maximum entropy learning , 2007, INTERSPEECH.

[9]  Nigel G. Ward Machine Translation: Past, Present, Future , 2001 .

[10]  Sara Stymne,et al.  Effects of Morphological Analysis in Translation between German and English , 2008, WMT@ACL.

[11]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[12]  Kemal Oflazer Statistical Machine Translation into a Morphologically Complex Language , 2008, CICLing.

[13]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[14]  Richard Xiao,et al.  Parallel and comparable corpora: What are they up to? , 2007 .

[15]  Gilles-Maurice de Schryver,et al.  Dictionary Writing System (DWS) + Corpus Query Package (CQP): The Case of "TshwaneLex" , 2010 .

[16]  Philip Resnik,et al.  Evaluating Translational Correspondence using Annotation Projection , 2002, ACL.

[17]  K Getao,et al.  Development of a corpus for Gikuyu using machine learning techniques , 2006 .

[18]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.