The Role of High Frequent Maximal Crossing Ambiguities in Chinese Word Segmentation

The solution of crossing ambiguities is still an open issue in the study of Chinese word segmentation. In this paper, we introduce the concept of maximal crossing ambiguity at first, divide it further into two major types, i.e., the true and the pseudo. Having observed a Chinese corpus with 100M characters, we find that the high frequent part of maximal crossing ambiguities is strong in coverage capacity (the coverage of the top 4,619 is as high as 59.20%, out of which 4,279 belongs to the pseudo type, with coverage 53.35%) and rather stable with regard to domain shifting. As a consequence, we propose for high frequent maximal crossing ambiguities a memory-based strategy that is expected to improve the performance of practical Chinese word segmenters significantly.