Extracting bilingual collocations from non-aligned parallel corpora

This paper proposes a new method to find correspondences of uninterrupted collocations from Japanese-English bilingual corpora without sentence-to-sentence alignment. Uninterrupted collocations in English such as “once again”, “give up”, or “gross national product” handled as a single word or a compound word in Japanese, can be automatically extracted with corresponding Japanese words using word co-occurrence frequencies in both corpora. The method consists of two stages. First, English and Japanese collocations are extracted separately from given corpora. After successive word units, which become collocation candidates, are collected by using n-gram statistics of each word, two kinds of entropy values, after-unit and before-unit are calculated for each unit to select word units surpassing thresholds as uninterrupted collocations. Second, a correspondent translation of each uninterrupted English collocation is extracted from the Japanese corpus by calculating correlation values between the target collocation and Japanese words or collocations which co-occur in the given corpora and using a basic English to Japanese word unit dictionary. Experiments are executed on economic articles of Asahi Newspaper as corpora. A Japanese word unit for each extracted English collocation is automatically obtained with more than 70 % precision rate, whereas the rate was about 40 % if only word-toword correspondence is used.

[1]  Hiroyuki Kaji,et al.  Extracting Word Correspondences from Bilingual Corpora Based on Word Co-occurrence Information , 1996, COLING.

[2]  Vasileios Hatzivassiloglou,et al.  Translating Collocations for Bilingual Lexicons: A Statistical Approach , 1996, CL.

[3]  Julian Kupiec,et al.  An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora , 1993, ACL.

[4]  Kenneth Ward Church,et al.  K-vec: A New Approach for Aligning Parallel Texts , 1994, COLING.

[5]  Kathleen McKeown,et al.  Translating Collocations for Use in Bilingual Lexicons , 1994, HLT.

[6]  Makoto Nagao,et al.  A New Method of N-gram Statistics for Large Number of n and Automatic Extraction of Words and Phrases from Large Text Data of Japanese , 1994, COLING.

[7]  Martin Kay,et al.  Text-Translation Alignment , 1993, Comput. Linguistics.

[8]  Satoshi Shirai,et al.  Automatic Extraction of Uninterrupted and Interrupted Collocations from Very Large Japanese Corpora Using N - gram Statistics , 1995 .

[9]  Kenneth Ward Church,et al.  Identifying Word Correspondences in Parallel Texts , 1991, HLT.

[10]  坂本 仁,et al.  Extraction of technical term bilingual dictionary from bilingual corpus , 1993 .

[11]  Dekai Wu,et al.  Learning an English-Chinese Lexicon from a Parallel Corpus , 1994, AMTA.

[12]  Pascale Fung,et al.  A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora , 1995, ACL.

[13]  John Cocke,et al.  A Statistical Approach to Language Translation , 1988, COLING.

[14]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[15]  Pascale Fung,et al.  Aligning Noisy Parallel Corpora Across Language Groups: Word Pair Feature Matching by Dynamic Time Warping , 1994, AMTA.

[16]  Satoru Ikehara,et al.  Learning Bilingual Collocations by Word-Level Sorting , 1996, COLING.

[17]  Yuji Matsumoto,et al.  Bilingual Text, Matching using Bilingual Dictionary and Statistics , 1994, COLING.