Web Translation Mining Based on Suffix Arrays

Mining translations from abundant Web data can be applied in many fields such as computer assisted learning, machine translation and cross-language information retrieval. How to mine possible translations from the Web and obtain the boundary of candidates, and how to remove irrelevant noises and rank the candidates are the challenging issues. In this paper, after reviewing and analyzing all possible methods of acquiring translations, a statistics method based on suffix arrays is proposed to mine term translations from the Web. The proposed method can not only mine different forms of Web translation distributions but also effectively obtain the correct boundary of translations, and then sort-based subset deletion and mutual information methods are respectively proposed to deal with subset redundancy information and affix redundancy information formed in the process of estimation. Experiments on two test sets of 401 English-Chinese terms and 100 English-Japanese terms validate that our system has good performance.

[1]  Jean V ronis Parallel Text Processing: Alignment and Use of Translation Corpora , 2002 .

[2]  Reinhard Rapp,et al.  Identifying Word Translations in Non-Parallel Texts , 1995, ACL.

[3]  Shigeru Masuyama,et al.  Identifying Translations of Compound Nouns Using Non-aligned Corpora , 1999 .

[4]  Kumiko Tanaka-Ishii,et al.  Extraction of Lexical Translations from Non-Aligned Corpora , 1996, COLING.

[5]  Hao Yu,et al.  Web-Based Terminology Translation Mining , 2005, IJCNLP.

[6]  Pascale Fung,et al.  Finding Terminology Translations from Non-parallel Corpora , 1997, VLC.

[7]  Jean Véronis,et al.  Parallel text processing :alignment and use of translationcorpora , 2000 .

[8]  Hang Li,et al.  Using Bilingual Web Data to Mine and Rank Translations , 2001 .

[9]  Reinhard Rapp,et al.  Automatic Identification of Word Translations from Unrelated English and German Corpora , 1999, ACL.

[10]  Hang Li,et al.  Base Noun Phrase Translation Using Web Data and the EM Algorithm , 2002, COLING.

[11]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[12]  Kenji Suzuki,et al.  Using the Web as a Bilingual Dictionary , 2001, DDMMT@ACL.

[13]  Aldo Gangemi,et al.  Ontology Learning and Its Application to Automated Terminology Translation , 2003, IEEE Intell. Syst..

[14]  Pascale Fung,et al.  An IR Approach for Translating New Words from Nonparallel, Comparable Texts , 1998, ACL.

[15]  Pascale Fung,et al.  Compiling Bilingual Lexicon Entries From a Non-Parallel English-Chinese Corpus , 1995, VLC@ACL.

[16]  Hao Yu,et al.  Chinese-English Term Translation Mining Based on Semantic Prediction , 2006, ACL.

[17]  Kunihiko Sadakane,et al.  Faster suffix sorting , 2007, Theoretical Computer Science.