Nowadays; Parallel corpus is one of the most important resources which can be employed in different researches such as machine translation, bilingual lexicography, and linguistics. This paper describes the process of building a large-scale (about 400, 000 sentence pairs) English-Persian parallel corpus called Tehran Parallel Corpus (TPC). The aim of study is to introduce the structure and explain the materials utilized for constructing TPC. In addition, some useful tools developed within the project have been introduced and three sorts of the statistical machine translation systems trained by TPC have been considered. In order to develop a high quality parallel corpus, unsure alignments recognized via a MaxEnt classifier have been eliminated from the corpus. As an intrinsic evaluation, 1,600 sentence pairs are elicited randomly and compared manually with a gold standard test set. As an extrinsic evaluation, three Phrase-based SMT systems, which is trained by TPC are incorporated. The results demonstrate the superiority of our translator systems over English to Persian Google translator system in term of BLEU and TER metrics.
[1]
Philipp Koehn,et al.
Moses: Open Source Toolkit for Statistical Machine Translation
,
2007,
ACL.
[2]
Chris Callison-Burch,et al.
Open Source Toolkit for Statistical Machine Translation: Factored Translation Models and Lattice Decoding
,
2006
.
[3]
Dan Klein,et al.
Optimization, Maxent Models, and Conditional Estimation without Magic
,
2003,
NAACL.
[4]
Salim Roukos,et al.
Bleu: a Method for Automatic Evaluation of Machine Translation
,
2002,
ACL.
[5]
Nitin Madnani,et al.
TER-Plus: paraphrase, semantic, and alignment enhancements to Translation Edit Rate
,
2009,
Machine Translation.
[6]
Shahram Khadivi,et al.
A discriminative approach to filter out noisy sentence pairs from bilingual corpora
,
2010,
2010 5th International Symposium on Telecommunications.