Automatic generation of Japanese–English bilingual thesauri based on bilingual corpora

The authors propose a method for automatically generating Japanese–English bilingual thesauri based on bilingual corpora. The term bilingual thesaurus refers to a set of bilingual equivalent words and their synonyms. Most of the methods proposed so far for extracting bilingual equivalent word clusters from bilingual corpora depend heavily on word frequency and are not effective for dealing with low-frequency clusters. These low-frequency bilingual clusters are worth extracting because they contain many newly coined terms that are in demand but are not listed in existing bilingual thesauri. Assuming that single language-pair-independent methods such as frequency-based ones have reached their limitations and that a language-pair-dependent method used in combination with other methods shows promise, the authors propose the following approach: (a) Extract translation pairs based on transliteration patterns; (b) remove the pairs from among the candidate words; (c) extract translation pairs based on word frequency from the remaining candidate words; and (d) generate bilingual clusters based on the extracted pairs using a graph-theoretic method. The proposed method has been found to be significantly more effective than other methods. © 2006 Wiley Periodicals, Inc.

[1]  Pascale Pung,et al.  A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora , 1995, ACL 1995.

[2]  Éric Gaussier Flow Network Models for Word Alignment and Terminology Extraction from Bilingual Corpora , 1998, COLING-ACL.

[3]  Kevin Knight,et al.  Machine Transliteration , 1997, CL.

[4]  Leah S. Larkey,et al.  Statistical transliteration for english-arabic cross language information retrieval , 2003, CIKM '03.

[5]  Kyo Kageura,et al.  Automatic Thesaurus Generation through Multiple Filtering , 2000, COLING.

[6]  Magnus Merkel,et al.  A Simple Hybrid Aligner for Generating Lexical Correspondences in Parallel Texts , 1998, ACL.

[7]  Jason S. Chang,et al.  A Class-based Approach to Word Alignment , 1997, CL.

[8]  Sadao Kurohashi,et al.  Finding Structural Correspondences from Bilingual Parsed Corpus for Corpus-based Translation , 2000, COLING.

[9]  Éric Gaussier,et al.  Towards Automatic Extraction of Monolingual and Bilingual Terminology , 1994, COLING.

[10]  Kenji Imamura,et al.  Hierarchical Phrase Alignment Harmonized with Parsing , 2001, NLPRS.

[11]  Dekai Wu,et al.  Learning an English-Chinese Lexicon from a Parallel Corpus , 1994, AMTA.

[12]  Philip Resnik,et al.  Word-level Alignment for Multilingual Resource Acquisition , 2002 .

[13]  Pim van der Eijk Automating the Acquisition of Bilingual Terminology , 1993, EACL.

[14]  Yuji Matsumoto,et al.  Automatic Extraction of Word Sequence Correspondences in Parallel Corpora , 1996, VLC@COLING.

[15]  Kenneth Ward Church,et al.  Identifying word correspondence in parallel texts , 1991 .

[16]  Pascale Fung,et al.  Aligning Noisy Parallel Corpora Across Language Groups: Word Pair Feature Matching by Dynamic Time Warping , 1994, AMTA.

[17]  In-Ho Kang,et al.  English-to-Korean Transliteration using Multiple Unbounded Overlapping Phoneme Chunks , 2000, COLING.

[18]  Sung-Hyon Myaeng,et al.  Automatic identification and back-transliteration of foreign words for information retrieval , 1999, Inf. Process. Manag..

[19]  I. Dan Melamed,et al.  Models of translation equivalence among words , 2000, CL.

[20]  Carol Peters,et al.  Multilingual information discovery and access (MIDAS) , 1999, DL '99.

[21]  Christopher D. Manning,et al.  Extentions to HMM-based Statistical Word Alignment Models , 2002, EMNLP.

[22]  Pascale Fung,et al.  A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora , 1995, ACL.

[23]  Stelios Piperidis,et al.  Generating Bilingual Lexical Equivalences from Parallel Texts , 1999, Appl. Artif. Intell..

[24]  Stuart C. Shapiro,et al.  Encyclopedia of artificial intelligence, vols. 1 and 2 (2nd ed.) , 1992 .

[25]  Yuji Matsumoto,et al.  Acquisition of Phrase-level Bilingual Correspondence using Dependency Structure , 2000, COLING.

[26]  Ralph Grishman,et al.  Alignment of Shared Forests for Bilingual Corpora , 1996, COLING.

[27]  A. Kawtrakul,et al.  Backward transliteration for Thai document retrieval , 1998, IEEE. APCCAS 1998. 1998 IEEE Asia-Pacific Conference on Circuits and Systems. Microelectronics and Integrating Systems. Proceedings (Cat. No.98EX242).

[28]  Djoerd Hiemstra Multilingual domain modeling in Twenty-One: automatic creation of a bi-directional translation lexicon from a parallel corpus , 1997 .

[29]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[30]  Lin Du,et al.  Word Alignment of English-Chinese Bilingual Corpus Based on Chucks , 2000, EMNLP.

[31]  Vasileios Hatzivassiloglou,et al.  Translating Collocations for Bilingual Lexicons: A Statistical Approach , 1996, CL.

[32]  Shankar Kumar,et al.  Minimum Bayes-Risk Word Alignments of Bilingual Texts , 2002, EMNLP.

[33]  Dekai Wu,et al.  An Algorithm for Simultaneously Bracketing Parallel Texts by Aligning Words , 1995, ACL.

[34]  Keita Tsuji Automatic Extraction of Translational Japanese-KATAKANA and English Word Pairs , 2002, Int. J. Comput. Process. Orient. Lang..

[35]  Hermann Ney,et al.  HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[36]  Julian Kupiec,et al.  An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora , 1993, ACL.