A comparison of morpheme and word based document retrieval for Asian languages

Most document retrieval systems are word based. Words are very convenient retrieval units in English but not so in some Asian languages. The task of determining which morphemes constitute words in Vietnamese and Chinese is problematic, and has been assumed to be the reason that word based retrieval does not work so well. The paper examines a number of segmentation algorithms, and then reports on some experiments comparing morpheme and word based retrieval. It shows that morpheme based retrieval is hard to improve on.