Mining English/Chinese Parallel Documents from the World Wide Web

The information available in languages other than English on the World Wide Web is increasing significantly. To cross language boundaries between different languages, dictionaries are the most typical tools. However, the general-purpose dictionary is less sensitive in genre and domain and it is impractical to manually construct tailored bilingual dictionaries or sophisticated multilingual thesauri for large applications. Corpus-based approaches, which do not have the limitation of dictionaries, provide a statistical translation model to cross the language boundary. The objective of this research work is to mine English/Chinese parallel documents automatically from the World Wide Web. In this paper, we present an alignment method based on dynamic programming to identify the one-to-one Chinese and English title pairs. The longest common subsequence (LCS) is applied to find the most reliable Chinese translation of an English word. A score function is then proposed to determine the optimal title pairs. Experiments have been conducted to investigate the performance of the proposed method. The precision of the result is 0.995 while the recall is 0.8096.