论文信息 - Parallel Sentences Mining From The Web

Parallel Sentences Mining From The Web

Parallel sentences can benefit many NLP applications (e.g., machine translation, cross language information retrieval.) In this paper, the candidate bilingual webs pages are returned by submit sentence pairs to search engine and then validated by surface patterns. We propose an algorithm to candidate bilingual resource extraction and filter useless bilingual web pages. The pair sentences included in the candidate bilingual web pages is verified by a maximum entropy classifier combining length, word-overlap, alignment and text location features. Training sets and the mining seeds are acquired automatically. Experiment shows satisfactory parallel resource mining performance.

Yanhui Feng | Yu Hong | Jianmin Yao | Zhenxiang Yan

[1] Ying Zhang,et al. Using the web for automated translation extraction in cross-language information retrieval , 2004, SIGIR '04.

[2] Qingsheng Zhu,et al. Mining Bilingual Data from the Web with Adaptively Learnt Patterns , 2009, ACL/IJCNLP.

[3] Xiaoyi Ma,et al. BITS: a method for bilingual text search over the Web , 1999, MTSUMMIT.

[4] Kenneth Ward Church,et al. Identifying word correspondence in parallel texts , 1991 .

[5] Ying Zhang,et al. Automatic Acquisition of Chinese-English Parallel Corpus from the Web , 2006, ECIR.

[6] Lei Shi,et al. A DOM Tree Alignment Model for Mining Parallel Data from the Web , 2006, ACL.

[7] Robert L. Mercer,et al. The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[8] Shuming Shi,et al. Title extraction from bodies of HTML documents and its application to web page retrieval , 2005, SIGIR '05.

[9] Dragos Stefan Munteanu,et al. Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[10] Kenneth Ward Church,et al. A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[11] Benjamin Van Durme,et al. Mining Parenthetical Translations from the Web by Word Alignment , 2008, ACL.

[12] Noah A. Smith,et al. The Web as a Parallel Corpus , 2003, CL.