Bilingual Sentence Alignment Based on Punctuation Marks

We present a new approach to aligning English and Chinese sentences in parallel corpora based solely on punctuations. Although the length based approach produces high accuracy rates of sentence alignment for clean parallel corpora written in two Western languages such as French-English and German-English, it does not fair as well for parallel corpora that are noisy or written in two distant languages such as Chinese-English. It is possible to use cognates on top of length-based approach to increase alignment accuracy. However, cognates do not exist between two distant languages, therefore limiting the applicability of cognate-based approach. In this paper, we examine the feasibility of using punctuations for high accuracy sentence alignment. We have experimented with an implementation of the proposed method on the parallel corpus of Chinese-English Sinorama Magazine Corpus with satisfactory results. We also demonstrated that the method was applicable to other language pairs such as English-Japanese with minimal additional effort.