New Word Discovery Algorithm Based on N-Gram for Multi-word Internal Solidification Degree and Frequency

For the problems of the low usability of the lexicon constructed by traditional new word discovery methods and the erroneous new word segmentation results that only consider the solidification of adjacent two characters, A new word discovery algorithm based on N-Gram's internal solidification and frequency of multiple characters is proposed. Firstly, select a fixed higher degree of coagulation n, calculate the internal coagulation degree of n under different values, and keep only the fragments higher than a certain threshold to form a set Q. Then combine named entity recognition to construct N-Gram segmentation corpus, and count the frequency. Introduce the backtracking mechanism, filter the candidate words by word frequency and K times of mutual information, use the Trie tree structure to improve the string retrieval speed, and filter the meaningless words in multiple dimensions through the rule stop vocabulary and Chinese stop vocabulary. Finally, compare the words that do not appear in the dictionary to get the new word set. Experiments show that this new word discovery algorithm effectively improves the accuracy of new word recognition, which is significantly faster than other new word discovery algorithms.