论文信息 - Automatic Acquisition of Basic Katakana Lexicon from a Given Corpus

Automatic Acquisition of Basic Katakana Lexicon from a Given Corpus

Katakana, Japanese phonogram mainly used for loan words, is a troublemaker in Japanese word segmentation. Since Katakana words are heavily domain-dependent and there are many Katakana neologisms, it is almost impossible to construct and maintain Katakana word dictionary by hand. This paper proposes an automatic segmentation method of Japanese Katakana compounds, which makes it possible to construct precise and concise Katakana word dictionary automatically, given only a medium or large size of Japanese corpus of some domain.

Daisuke Kawahara | Sadao Kurohashi | Toshiaki Nakazawa

[1] Hiroshi Nakagawa,et al. Automatic Construction of Japanese KATAKANA Variant List from Large Corpus , 2004, COLING.

[2] Hiroshi Nakagawa,et al. Information Retrieval Based on Combination of Japanese Compound Words Matching and Co - occurrence Based Retrieval , 1998 .

[3] Hiroshi Nakagawa,et al. Automatic Term Recognition by the Relation between Compound Nouns and Basic Nouns , 2000 .

[4] Kevin Knight,et al. Machine Transliteration , 1997, CL.

[5] Yuji Matsumoto,et al. Japanese Morphological Analysis System ChaSen version 2.0 Manual , 1999 .

[6] Hiroshi Nakagawa,et al. Term Extraction Based on Occurrence and Concatenation Frequency. , 2003 .

[7] Satoshi Sato,et al. Integrating Cross-Lingually Relevant News Articles and Monolingual Web Documents in Bilingual Lexicon Acquisition , 2004, COLING.