Cross-level Sentence Alignment

This paper describes a new model for sentence alignment system of structurally different languages such as Chinese and Portuguese. The alignment may be one-to-one, one-to-many, many-to-one and many-to-many. It is also not surprised that the first word or sentence of Chinese is a translation of the last word or sentence of Portuguese. In this proposed method, we try to combine the statistical approach and lexical approach in order to achieve the efficiency and accuracy. In our current research, we first complete the word level alignment by making use of the Chinese-Portuguese dictionary to get the basic translation rate between the two texts. However, the system cannot make a good decision in processing the word alignment by just only concerning the information achieved from the Chinese-Portuguese dictionary. This is because most often the named entities cannot be discovered from the dictionary and this causes the system make a wrong decision. In order to make the system more adaptive, we apply the maximum entropy model to align the named entities without perform the word segmentation for Chinese. Secondly, from the word level alignment, we achieve anchor point and process the sentence level alignment. We use the Hidden Markov Model (HMM) and Singular Value Decomposition (SVD) to get the statistical information of the sentences. For SVD model, we first set up a matrix which consists of the word level alignment statistic information. Then performs the two dimensional reconstruction of the original matrix. By comparing Figure 1 and Figure 2 (sample fragments of data), we can observe some relationships among the sentences and this can give an approximation of sentence alignment.