A Statistical Model for Automatic Extraction of Korean Transliterated Foreign Words

In this paper, we will describe a Korean transliterated foreign word extraction algorithm. In the proposed method, we reformulate the foreign word extraction problem as a syllable-tagging problem such that each syllable is tagged with a foreign syllable tag or a pure Korean syllable tag. Syllable sequences of Korean strings are modelled by Hidden Markov Model whose state represents a character with binary marking to indicate whether the syllable is part of a transliterated foreign word or not. The proposed method extracts a transliterated foreign word with high recall rate and precision rate. Moreover, our method shows good performance even with small-sized training corpora.

[1]  Keita Tsuji Automatic Extraction of Translational Japanese-KATAKANA and English Word Pairs , 2002, Int. J. Comput. Process. Orient. Lang..

[2]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[3]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[4]  James Davidson,et al.  Natural Language Understanding. , 1979 .

[5]  Key-Sun Choi,et al.  Two approaches for the resolution of word mismatch problem caused by English words and foreign words in Korean information retrieval , 2000, IRAL '00.

[6]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[7]  James F. Allen Natural language understanding , 1987, Bejnamin/Cummings series in computer science.

[8]  이경희,et al.  한국어 문서에서 개체명 인식에 관한 연구 = Study on named entity recognition in Korean text , 2000 .

[9]  Key-Sun Choi,et al.  Effective foreign word extraction for Korean information retrieval , 2002, Inf. Process. Manag..

[10]  Hae-Chang Rim,et al.  Automatic Word Spacing Using Hidden Markov Model for Refining Korean Text Corpora , 2002, ALR@COLING.

[11]  Sung Hyon Myaeng,et al.  The Effect of a Proper Handling of Foreign and English Words in Retrieving Korean Text , 1997 .

[12]  Jae-Seong Lee,et al.  Phonetic Similarity Meausre for the Korean Transliterations of Foreign Words , 1999 .

[13]  Biing-Hwang Juang,et al.  Hidden Markov Models for Speech Recognition , 1991 .

[14]  Key-Sun Choi,et al.  Japanese term extraction using dictionary hierarchy and machine translation system , 2000 .

[15]  Key-Sun Choi,et al.  Automatic Extraction of Trasliterated Foreign words using HMM , 2001 .