E-54 Aligning for SMT : Results from Real World Corpora

Successes in the field of Statistical Machine Translation (Brown et al 1993) coupled with the recent availability of public domain tools such as EGYPT (Al Onaizan et al 2000) have contributed to an upsurge of interest in corpus based MT. One factor which still inhibits the expansion of SMT research is the scarcity of bilingual corpora. Complexity constraints on training and decoding also require that maximum sentence length be limited. At ATR we are working with "real-world" corpora of content-aligned News articles (Tanaka et al 2002). Each corpus needs aligning at a phrasal level to be useful in training SMT models. This paper reports preliminary results of that alignment effort, and the paper is structured as follows: Section 2 introduces the corpus, Section 3 summarizes the alignment method, Section 4 discusses the results. The conclusion looks forward to methods which can find statistical regularity in real world corpora.