Mining Bilingual Data from the Web with Adaptively Learnt Patterns

Mining bilingual data (including bilingual sentences and terms) from the Web can benefit many NLP applications, such as machine translation and cross language information retrieval. In this paper, based on the observation that bilingual data in many web pages appear collectively following similar patterns, an adaptive pattern-based bilingual data mining method is proposed. Specifically, given a web page, the method contains four steps: 1) preprocessing: parse the web page into a DOM tree and segment the inner text of each node into snippets; 2) seed mining: identify potential translation pairs (seeds) using a word based alignment model which takes both translation and transliteration into consideration; 3) pattern learning: learn generalized patterns with the identified seeds; 4) pattern based mining: extract all bilingual data in the page using the learned patterns. Our experiments on Chinese web pages produced more than 7.5 million pairs of bilingual sentences and more than 5 million pairs of bilingual terms, both with over 80% accuracy.

[1]  Benjamin Van Durme,et al.  Mining Parenthetical Translations from the Web by Word Alignment , 2008, ACL.

[2]  Jian-Yun Nie,et al.  Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web , 1999, SIGIR '99.

[3]  Ying Zhang,et al.  Mining Key Phrase Translations from Web Corpora , 2005, HLT.

[4]  Keh-Jiann Chen,et al.  Word Identification for Mandarin Chinese Sentences , 1992, COLING.

[5]  Jason S. Chang,et al.  Learning Source-Target Surface Patterns for Web-based Terminology Translation , 2005, ACL.

[6]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[7]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[8]  Chia-Hui Chang,et al.  IEPAD: information extraction based on pattern discovery , 2001, WWW '01.

[9]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[10]  Long Jiang,et al.  Named Entity Translation with Web Mining and Transliteration , 2007, IJCAI.

[11]  Emmanuel Morin,et al.  Extracting Semantic Relationships between Terms: Supervised vs. Unsupervised Methods , 1999 .

[12]  Ying Zhang,et al.  Using the web for automated translation extraction in cross-language information retrieval , 2004, SIGIR '04.

[13]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[14]  Hsi-Jian Lee,et al.  Anchor text mining for translation of Web queries: A transitive translation approach , 2004, TOIS.

[15]  MarcuDaniel,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005 .

[16]  Lei Shi,et al.  A DOM Tree Alignment Model for Mining Parallel Data from the Web , 2006, ACL.

[17]  Hwee Tou Ng,et al.  Mining New Word Translations from Comparable Corpora , 2004, COLING.

[18]  Pu-Jen Cheng,et al.  Translating unknown queries with web corpora for cross-language information retrieval , 2004, SIGIR '04.

[19]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[20]  Key-Sun Choi,et al.  Bilingual Knowledge Acquisition from Korean-English Parallel Corpus Using Alignment , 1996, COLING.