Parallel corpora are essential resources for the construction of bilingual term dictionary of historical classics. To obtain large-scale parallel corpora, this paper proposes a sentence alignment method based on mode prediction and term translation pairs. On one hand, the method rebuilds the sentence alignment process according to characteristics of the translation of historical classics, and adds mode prediction into the sentence alignment. On the other hand, due to the lack of bilingual ancient Chinese dictionary, the method exploits the term translation pairs extracted from manually aligned sentence pairs to perform alignment. The method first predicts the alignment mode probability according to the character number, punctuation number and some characters of Chinese sentence, then performs sentence alignment using length alignment probability, term alignment probability and mode probability. Besides, the method selects anchor sentence pairs based on sentence length and predicted mode to prevent the spread of alignment errors. The experiment on ”Shi Ji” demonstrates that mode prediction and term translation pair both enhance the performance of sentence alignment obviously.
[1]
L Xue,et al.
Sub-Sentence Alignment of Chinese-English Law Literature Based on Statistical Approach
,
2003
.
[2]
Lin Hong-fei.
Sentence Alignment of Bilingual Biomedical Abstract Based on Anchor Information
,
2009
.
[3]
Robert C. Moore.
Fast and accurate sentence alignment of bilingual corpora
,
2002,
AMTA.
[4]
Jian Wu,et al.
Dictionary-based Chinese-Tibetan sentence alignment
,
2010,
2010 International Conference on Intelligent Computing and Integrated Systems.
[5]
Long Yu,et al.
Chinese-Uighur Sentence Alignment Based on Hybrid Strategy with Mistake Spread Suppression
,
2009,
2009 International Conference on Environmental Science and Information Application Technology.
[6]
Shingo Kuroiwa,et al.
Sentence alignment using P-NNT and GMM
,
2007,
Comput. Speech Lang..
[7]
Kenneth Ward Church,et al.
A Program for Aligning Sentences in Bilingual Corpora
,
1993,
CL.