Automatic new word extraction method
暂无分享,去创建一个
New words are very difficult to be extracted automatically for those languages where there is no word boundary in written texts, such as Chinese, Japanese etc. In this paper, we present a Statistical method to extract new words from a large amount of corpus with no word boundary. Based on Generalized Suffix Tree (GST) data structure we define NWP (New Word Pattern) and SBP (Segmentation Boundary Pattern) to separate input strings into small pieces, and offer a practical and efficient algorithm to get the proper words from GST.
[1] Lucas Chi Kwong Hui,et al. Color Set Size Problem with Application to String Matching , 1992, CPM.
[2] Steve Young,et al. Corpus-based methods in language and speech processing , 1997 .