Chinese Word Segmentation Based on the Marginal Probabilities Generated by CRFs

The method of treating the word segmentation issue as a sequence tagging problem and using CRFs has been widely applied recently.However,in this method,some wrong tags are produced by CRFs.To reduce the number of wrong tags,we propose a new method based on the marginal probabilities generated by CRFs for Chinese word segmentation.Firstly,the candidate words with high marginal probabilities are extracted from the tagging results.Then,the candidate words of low marginal probabilities in the tagging results are recombined.Finally,a mechanism of premium that is built on FMM is introduced to complement the sub-strings produced by the recombinant procedure.Evalued by the closed track of SXU and NCC corpora in the fourth SIGHAN Chinese Word Segmentation Bakeoff,this method produces an F-score of 96.41% and 94.30%,respectively.