Overlapping statistical segmentation for effective indexing of Japanese text

Because word boundaries are not clearly indicated in Asian languages, including Japanese, word indexing cannot be applied easily. Although dictionary-based methods for segmenting text enable word indexing, they give rise to some problems, such as dictionary maintenance. N-Gram indexing, another commonly used method, suffers from low retrieval performance and increase in index size. This paper proposes a new statistical indexing method. It uses a new measure, computed using statistics about characters to evaluate a bi-gram's likelihood of being a word boundary, and a new segmentation strategy which extracts some overlapping segments in addition to the segments extracted using the current strategy. As a result, our method achieves higher retrieval effectiveness.