MODELS AND ALGORITHM FOR ASSIGNING WORD BREAKS TO CHINESE TEXT

In this paper, the word form model (WFM) based on word formation power of Chinese character string and the character juncture model (CJM) based on the affinity of the Chinese character pairs inside or outside words are described respectively. Then a linear interpolation method is applied to combine these two models together to assign word breaks to Chinese text. The relative searching algorithm is also given after the searching space is analyzed. Compared with general statistic models, the parameters of the models proposed can be directly trained from raw corpus, which results in a strong adaptability. The approach has proven both reliable and efficient by experiments.