Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data

Chinese word segmentation is the first step in any Chinese NLP system. This paper presents a new algorithm for segmenting Chinese texts without making use of any lexicon and hand-crafted linguistic resource. The statistical data required by the algorithm, that is, mutual information and the difference of t-score between characters, is derived automatically from raw Chinese corpora. The preliminary experiment shows that the segmentation accuracy of our algorithm is acceptable. We hope the gaining of this approach will be beneficial to improving the performance (especially in ability to cope with unknown words and ability to adapt to various domains) of the existing segmenters, though the algorithm itself can also be utilized as a stand-alone segmenter in some NLP applications.