Filtering noisy parallel corpora of web pages

In our previous study, we successfully built an automatic mining system for parallel texts from the Web - PTMiner that is able to determine a large number of parallel Web pages for different language pairs. However, there are a number of non-parallel text pairs in this corpus. This paper proposes a filtering approach to clean up the corpus. Our experiments show that once the corpus is cleaned, both the translation accuracy of the resulting translation models and the effectiveness of cross-language information retrieval (CLIR) using these models are improved significantly.