论文信息 - Building an English-Vietnamese Bilingual Corpus for Machine Translation

Building an English-Vietnamese Bilingual Corpus for Machine Translation

Bilingual corpora are critical resources for machine translation research and development since parallel corpora contain translation equivalences of various granularities. Manual annotation of word alignments is of significance to provide a gold-standard for developing and evaluating both example-based machine translation models and statistical machine translation models. This paper presents research on building an English-Vietnamese parallel corpus, which is constructed for building a Vietnamese-English machine translation system. We describe the specification of collecting data for the corpus, linguistic tagging, bilingual annotation, and the tools specially developed for the manual annotation. An English-Vietnamese bilingual corpus of over 800,000 sentence pairs and 10,000,000 English words as well as Vietnamese words has been collected and aligned at the sentence level, and over 45,000 sentence pairs of this corpus have been aligned at the word level.

Werner Winiwarter | Hung Quoc Ngo | H. Ngo | W. Winiwarter

[1] Kiem Hoang,et al. POS-Tagger for English-Vietnamese Bilingual Corpus , 2003, ParallelTexts@NAACL-HLT.

[2] Dan Klein,et al. Accurate Unlexicalized Parsing , 2003, ACL.

[3] Hô Tuòng Vinh,et al. A Hybrid Approach to Word Segmentation of Vietnamese Texts , 2008, LATA.

[4] Jason S. Chang,et al. A Class-based Approach to Word Alignment , 1997, CL.

[5] Nguyen Thi Huong Thao,et al. Vietnamese Noun Phrase Chunking Based on Conditional Random Fields , 2009, 2009 International Conference on Knowledge and Systems Engineering.

[6] Akira Shimazu,et al. An Empirical Study of Vietnamese Noun Phrase Chunking with Discriminative Sequence Models , 2009, ALR7@IJCNLP.

[7] Christopher D. Manning,et al. Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[8] Mark Steedman,et al. Unbounded Dependency Recovery for Parser Evaluation , 2009, EMNLP.

[9] Werner Winiwarter,et al. A Visualizing Annotation Tool for Semi-Automatically Building a Bilingual Corpus , 2012 .

[10] Bao-Quoc Ho,et al. Automatic Construction of English-Vietnamese Parallel Corpus through Web Mining , 2007, 2007 IEEE International Conference on Research, Innovation and Vision for the Future.

[11] Hermann Ney,et al. A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[12] Dien Dinh,et al. Building a Training Corpus for Word Sense Disambiguation in English-to-Vietnamese Machine Translation , 2002, COLING 2002.