A Method to Chinese-Vietnamese Bilingual Metallurgy Term Extraction Based on a Pivot Language

To settle resource scarcity problem for Chinese-Vietnamese bilingual aligned corpus in metallurgy field, a method to Chinese-Vietnamese bilingual term extraction in metallurgy field based on a pivot language is proposed. Firstly, term-unit and term-hood features are selected and inputted to the trained CRFs model to identify and extract Chinese metallurgy terminology. Secondly, the phrase-based statistical machine translation model is used to generate the Chinese-English phrase table and English-Vietnamese phrase table. With the pivot mapping idea, A Chinese-Vietnamese phrase table will be inferred out through pivot English. Finally, the former extracted Chinese metallurgy terms are used to filter the Chinese-Vietnamese phrase table, a Chinese-Vietnamese bilingual metallurgy term base, therefore, will be built. Experiments show that the proposed method achieved an accuracy rate at 69.45%. The method, under the resource absence of Chinese-Vietnamese bilingual alignment corpus, is validated as an effective solution to the difficult problem for Chinese-Vietnamese bilingual metallurgy term extraction.

[1]  Sun Le Automatic Extraction of Bilingual Term Lexicon from Parallel Corpora , 2000 .

[2]  Didier Bourigault,et al.  Surface Grammatical Analysis for the Extraction of Terminological Noun Phrases , 1992, COLING.

[3]  Chunhong Wang,et al.  A hybrid strategy for Chinese domain-specific terminology extraction , 2015, IJCNN.

[4]  Hideki Mima,et al.  Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[5]  Ahmet Aker,et al.  Extracting bilingual terminologies from comparable corpora , 2013, ACL.

[6]  Maosong Sun,et al.  Fast-Champollion: A Fast and Robust Sentence Alignment Algorithm , 2010, COLING.

[7]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[8]  Sophia Ananiadou,et al.  A Methodology for Automatic Term Recognition , 1994, COLING.

[9]  Nigel Collier,et al.  Use of Support Vector Machines in Extended Named Entity Recognition , 2002, CoNLL.

[10]  Yao Xianming Named Entity Recognition for the Tourism Domain Based on Cascaded Conditional Random Fields , 2009 .

[11]  Li Xiu-ying Terminology and Machine Translation——Experimental Results and Construction of Terminological Databank , 2008 .

[12]  Takahiro Hara,et al.  An Approach for Extracting Bilingual Terminology from Wikipedia , 2008, DASFAA.

[13]  Wang Shi Web-based Term Translation Extraction and Verification Method , 2012 .

[14]  Andy Way,et al.  Bilingual Termbank Creation via Log-Likelihood Comparison and Phrase-Based Statistical Machine Translation , 2014 .

[15]  Andrew McCallum,et al.  Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data , 2004, J. Mach. Learn. Res..

[16]  Hanna M. Wallach,et al.  Efficient Training of Conditional Random Fields , 2002 .

[17]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[18]  Slava M. Katz,et al.  Technical terminology: some linguistic properties and an algorithm for identification in text , 1995, Natural Language Engineering.