POC-NLW Template Based Tagging Method for Chinese Word Segmentation

In Chinese word segmentation, disambiguation and unknown words identification are becoming the two key issues. In this paper, a two-stage strategy based system is constructed to deal with these problems. First, an n-gram based model is applied to do the basic segmentation as well as disambiguation in some extent. Then, in the second stage, a language tagging template, named POC-NLW, is adopted to carry out a character sequence tagging procedure based on hidden Markov model, which is used to refine the results from the first stage and to identify unknown words. Several detailed experiments have been implemented on the SIGHAN Bakeoff 2005 corpus. Experimental results show that the method can achieve high accuracy on word segmentation, as well as on unknown words identification, with appreciable processing efficiency. This method is characterized by the good interoperability and expansionary over different kinds of unknown words, thus it is applicable for practical Chinese information processing applications