论文信息 - Resource Report: Building Parallel Text Corpora for Multi-Domain Translation System

Resource Report: Building Parallel Text Corpora for Multi-Domain Translation System

Parallel text is one of the most valuable resources for development of statistical machine translation systems and other NLP applications. However, manual translations are very costly, and the number of known parallel text is limited. Hence, our research started with creating and collecting a large amount of parallel text resources for Indonesian-English. We describe in this paper the creation of parallel corpora: ANTARA News, BPPT-PANL and BTEC-ATR. In order to be useful, these resources must be available in reasonable quantities and qualities to be useful for statistical approaches to language processing. We describe problem and solution as well robust tools and annotation schema to build and process these corpora.

[1] EstimationPeter,et al. The Mathematics of Machine Translation : Parameter , 2004 .

[2] Satoshi Nakamura,et al. Development of Indonesian Large Vocabulary Continuous Speech Recognition System within A-STAR Project , 2008, IJCNLP.

[3] Michel Simard,et al. Bilingual Sentence Alignment: Balancing Robustness and Accuracy , 2004, Machine Translation.

[4] Robert L. Mercer,et al. The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.