论文信息 - Filtering noisy parallel corpora of web pages

Filtering noisy parallel corpora of web pages

In our previous study, we successfully built an automatic mining system for parallel texts from the Web - PTMiner that is able to determine a large number of parallel Web pages for different language pairs. However, there are a number of non-parallel text pairs in this corpus. This paper proposes a filtering approach to clean up the corpus. Our experiments show that once the corpus is cleaned, both the translation accuracy of the resulting translation models and the effectiveness of cross-language information retrieval (CLIR) using these models are improved significantly.

Jian Cai | Jian-Yun Nie

[1] Jian-Yun Nie,et al. Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web , 1999, SIGIR '99.

[2] Kenneth Ward Church,et al. A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[3] Robert L. Mercer,et al. The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[4] Michel Simard,et al. Using cognates to align sentences in bilingual corpora , 1993, TMI.

[5] Stanley F. Chen,et al. Aligning Sentences in Bilingual Corpora Using Lexical Information , 1993, ACL.

[6] Dekai Wu,et al. Aligning a Parallel English-Chinese Corpus Statistically With Lexical Criteria , 1994, ACL.

[7] Kui-Lam Kwok,et al. TREC-5 English and Chinese Retrieval Experiments using PIRCS , 1996, TREC.

[8] Jian-Yun Nie,et al. Automatic construction of parallel English-Chinese corpus for cross-language information retrieval , 2000, ANLP.