Automatic extraction of bilingual word pairs from parallel corpora with various languages using learning for adjacent information

This paper presents a learning method using adjacent information as the method to extract bilingual word pairs efficiently from parallel corpora with various languages for which language resources are insufficient. In our method, information about correspondence between source language words and target language words is acquired automatically using the word strings that adjoin bilingual word pairs. That acquired information is used to solve the ambiguity problem of correspondence between source language words and target language words in various bilingual sentence pairs. First, the system using our method automatically acquires templates as information that indicates correspondence between source language words and target language words. The templates are based on word strings that adjoin the bilingual word pairs. Moreover, the system using our method efficiently extracts bilingual word pairs from bilingual sentence pairs using the acquired templates. Evaluation experiments showed that the system using our method extracted bilingual word pairs from parallel corpora with five kinds of languages. Results show that the total extraction rate was 60.1p. The total extraction rate was better by 8.0 percentage points compared to that obtained using a system based only on the Dice coefficient without our method. Those results confirm the effectiveness of our method. © 2006 Wiley Periodicals, Inc. Syst Comp Jpn, 37(13): 40–53, 2006; Published online in Wiley InterScience (). DOI 10.1002sscj.20534