论文信息 - Saturnalia: A Latin-Catalan Parallel Corpus for Statistical MT

Saturnalia: A Latin-Catalan Parallel Corpus for Statistical MT

Currently, a great effort is being carried out in the digitalisation of large historical document collections for preservation purposes. The documents in these collections are usually written in ancient languages, such as Latin or Greek, which limits the access of the general public to their content due to the language barrier. Therefore, digital libraries aim not only at storing raw images of digitalised documents, but also to annotate them with their corresponding text transcriptions and translations into modern languages. Unfortunately, ancient languages have at their disposal scarce electronic resources to be exploited by natural language processing techniques. This paper describes the compilation process of a novel Latin-Catalan parallel corpus as a new task for statistical machine translation (SMT). Preliminary experimental results are also reported using a state-of-the-art phrase-based SMT system. The results presented in this work reveal the complexity of the task and its challenging, but interesting nature for future development.

Francisco Casacuberta | Alfons Juan-Císcar | Jesús González-Rubio | Jorge Civera

[1] Philipp Koehn,et al. (Meta-) Evaluation of Machine Translation , 2007, WMT@ACL.

[2] András Kornai,et al. Parallel corpora for medium density languages , 2007 .

[3] Franz Josef Och,et al. Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[4] Germán Sanchis-Trilles,et al. A novel alignment model inspired on IBM Model 1 , 2008, EAMT.

[5] Hermann Ney,et al. Improving SMT quality with morpho-syntactic analysis , 2000, COLING.

[6] Philipp Koehn,et al. Factored Translation Models , 2007, EMNLP.

[7] F. Casacuberta,et al. Bilingual Corpora Segmentation Using Bilingual Recursive Alignments , 2022 .

[8] Lluís Padró,et al. FreeLing 1.3: Syntactic and semantic services in an open-source NLP library , 2006, LREC.

[9] Kenneth Ward Church,et al. A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[10] Daniel Marcu,et al. Statistical Phrase-Based Translation , 2003, NAACL.

[11] Philip Resnik,et al. The Bible as a Parallel Corpus: Annotating the ‘Book of 2000 Tongues’ , 1999, Comput. Humanit..

[12] I. Dan Melamed,et al. Bitext Maps and Alignment via Pattern Recognition , 1999, CL.

[13] Philipp Koehn,et al. Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.