论文信息 - Using n-grams for Korean text retrieval

Using n-grams for Korean text retrieval

There is a difficulty in applying the conventional word-based indexing to Korean+. The indexable segment of a word, i.e. stem is often a compound noun, which results in the serious decrease of retrieval effectiveness. The morpheme-based indexing, which decomposes a compound noun into simple nouns, has been developed to overcome the problem of compound nouns. It, however, requires a large dictionary and complex linguistic knowledge. In this paper we propose a new indexing method by combining the word-based indexing and the n-gram indexing. The proposed method alleviates the problem of compound nouns without dictionaries and linguistic knowledge. Experiment al results show that the proposed method might be almost as effective as the morpheme-based indexing.

Jeong Soo Ahn | Joon Ho Lee

[1] Gerard Salton,et al. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[2] Donna K. Harman,et al. Overview of the Third Text REtrieval Conference (TREC-3) , 1995, TREC.

[3] Yong-Hee Yae. Automatic Keyword Extraction System for Korean Documents Information Retrieval , 1992 .

[4] Joon Ho Lee,et al. Combining multiple evidence from different properties of weighting schemes , 1995, SIGIR '95.

[5] Gerard Salton,et al. A vector space model for automatic indexing , 1975, CACM.

[6] 정진성,et al. 단일문서내에서의 언어 및 통계정보를 이용한 자동색인 = Local term weighting based on linguistic and statistical information , 1992 .

[7] D. K. Harmon,et al. Overview of the Third Text Retrieval Conference (TREC-3) , 1996 .

[8] W. B. Cavnar,et al. N-Gram-Based Text Filtering For TREC-2 , 1993, TREC.

[9] Michael McGill,et al. Introduction to Modern Information Retrieval , 1983 .

[10] M Damashek,et al. Gauging Similarity with n-Grams: Language-Independent Categorization of Text , 1995, Science.

[11] W. Bruce Croft,et al. The INQUERY Retrieval System , 1992, DEXA.

[12] Gerard Salton,et al. Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[13] W. B. Cavnar,et al. Using An N-Gram-Based Document Representation With A Vector Processing Retrieval Model , 1994, TREC.