Model of Chinese Words Rough Segmentation Based on N-Shortest-Paths Method
暂无分享,去创建一个
As the very first step of Chinese word segmentation,rough segmentation tries to cover the correct segmentation with as few candidates as possible. This paper presents a model of rough segmentation, which is based on the N-shortest-paths method,to achieve the goal. In parallel,a statistical model can easily be obtained by attaching frequencies to the edges of the word-graphs. Experiments have been made on a one-month news corpus of 185,192 sentences from the People s Daily. By sentence,the recalling rate of the non-statistical model based on 2-shortest-paths method is 99.73 % . When the statistical model is applied, a recalling rate as high as 99. 94 % , nearly 6.4% higher than known best approach and 15% higher than the maximum matching segmentation, can be reached with 6.12 candidates on average. In addition, the average number of segmentation candidates is reduced by 64 times as compared to the approach of full segmentation. The result shows that the N-shortest-paths method is effective for the task of rough segmentation.