论文信息 - Mining English/Chinese Parallel Documents from the World Wide Web

Mining English/Chinese Parallel Documents from the World Wide Web

The information available in languages other than English on the World Wide Web is increasing significantly. To cross language boundaries between different languages, dictionaries are the most typical tools. However, the general-purpose dictionary is less sensitive in genre and domain and it is impractical to manually construct tailored bilingual dictionaries or sophisticated multilingual thesauri for large applications. Corpus-based approaches, which do not have the limitation of dictionaries, provide a statistical translation model to cross the language boundary. The objective of this research work is to mine English/Chinese parallel documents automatically from the World Wide Web. In this paper, we present an alignment method based on dynamic programming to identify the one-to-one Chinese and English title pairs. The longest common subsequence (LCS) is applied to find the most reliable Chinese translation of an English word. A score function is then proposed to determine the optimal title pairs. Experiments have been conducted to investigate the performance of the proposed method. The precision of the result is 0.995 while the recall is 0.8096.

Christopher C. Yang | Kar Wing Li | Christopher C. Yang | K. W. Li

[1] Jarle Ebeling,et al. Contrastive Linguistics, Translation, and Parallel Corpora , 1998 .

[2] Federico Zanettin,et al. Bilingual Comparable Corpora and the Training of Translators , 1998 .

[3] Douglas W. Oard,et al. A survey of multilingual text retrieval , 1996 .

[4] Philip Resnik,et al. Mining the Web for Bilingual Text , 1999, ACL.

[5] Shaoyi He. Translingual alteration of conceptual information in medical translation , 2000 .

[6] Elliott Macklovitch,et al. Line ‘Em Up: Advances in Alignment Technology and their Impact on Translation Support Tools , 2004, Machine Translation.

[7] Philip Resnik,et al. Parallel strands: a preliminary investigation into mining the Web for bilingual text , 1998, AMTA.