Knowledge Acquisition Method for Large-Scale Bilingual Corpus Based on Web Mining

This paper describes a method to acquire multi-word translational equivalences from English-Chinese parallel corpora based on Web mining. To solve the correspondence problem of multiple word, N-gram model is adopted to extract candidate translate units. Then the co-occurrence information is used to acquire subject words related to resource proper noun from search engine. The subject terms translation is adopted to perform language-crossed extension, and the extended query will obtain bilingual abstract resources with high quality from the search engine. We also extract the candidate translate units such as compound words and phrases, based on frequency change information and adjacency information, and make final selection of proper nouns integrated transliteration features, statistical features and template features. The experiments show that the translation mining method proposed in this paper has good performance.