Using n-grams for Korean text retrieval

There is a difficulty in applying the conventional word-based indexing to Korean+. The indexable segment of a word, i.e. stem is often a compound noun, which results in the serious decrease of retrieval effectiveness. The morpheme-based indexing, which decomposes a compound noun into simple nouns, has been developed to overcome the problem of compound nouns. It, however, requires a large dictionary and complex linguistic knowledge. In this paper we propose a new indexing method by combining the word-based indexing and the n-gram indexing. The proposed method alleviates the problem of compound nouns without dictionaries and linguistic knowledge. Experiment al results show that the proposed method might be almost as effective as the morpheme-based indexing.