论文信息 - Automatic new word extraction method

Automatic new word extraction method

New words are very difficult to be extracted automatically for those languages where there is no word boundary in written texts, such as Chinese, Japanese etc. In this paper, we present a Statistical method to extract new words from a large amount of corpus with no word boundary. Based on Generalized Suffix Tree (GST) data structure we define NWP (New Word Pattern) and SBP (Segmentation Boundary Pattern) to separate input strings into small pieces, and offer a practical and efficient algorithm to get the proper words from GST.

Qin Shi | Li Qin Shen | Haixin Chai

[1] Lucas Chi Kwong Hui,et al. Color Set Size Problem with Application to String Matching , 1992, CPM.

[2] Steve Young,et al. Corpus-based methods in language and speech processing , 1997 .