论文信息 - Overlapping statistical segmentation for effective indexing of Japanese text

Overlapping statistical segmentation for effective indexing of Japanese text

Because word boundaries are not clearly indicated in Asian languages, including Japanese, word indexing cannot be applied easily. Although dictionary-based methods for segmenting text enable word indexing, they give rise to some problems, such as dictionary maintenance. N-Gram indexing, another commonly used method, suffers from low retrieval performance and increase in index size. This paper proposes a new statistical indexing method. It uses a new measure, computed using statistics about characters to evaluate a bi-gram's likelihood of being a word boundary, and a new segmentation strategy which extracts some overlapping segments in addition to the segments extracted using the current strategy. As a result, our method achieves higher retrieval effectiveness.

Toru Matsuda | Yasushi Ogawa | Yasushi Ogawa | Toru Matsuda