论文信息 - New Word Discovery Algorithm Based on N-Gram for Multi-word Internal Solidification Degree and Frequency

New Word Discovery Algorithm Based on N-Gram for Multi-word Internal Solidification Degree and Frequency

For the problems of the low usability of the lexicon constructed by traditional new word discovery methods and the erroneous new word segmentation results that only consider the solidification of adjacent two characters, A new word discovery algorithm based on N-Gram's internal solidification and frequency of multiple characters is proposed. Firstly, select a fixed higher degree of coagulation n, calculate the internal coagulation degree of n under different values, and keep only the fragments higher than a certain threshold to form a set Q. Then combine named entity recognition to construct N-Gram segmentation corpus, and count the frequency. Introduce the backtracking mechanism, filter the candidate words by word frequency and K times of mutual information, use the Trie tree structure to improve the string retrieval speed, and filter the meaningless words in multiple dimensions through the rule stop vocabulary and Chinese stop vocabulary. Finally, compare the words that do not appear in the dictionary to get the new word set. Experiments show that this new word discovery algorithm effectively improves the accuracy of new word recognition, which is significantly faster than other new word discovery algorithms.

Xudong Li | Xiangyang Chen

[1] Md. Osman Gani,et al. Prediction of State of Wireless Network Using Markov and Hidden Markov Model , 2009, J. Networks.

[2] Daniel Dajun Zeng,et al. Domain-specific Chinese word segmentation using suffix tree and mutual information , 2011, Inf. Syst. Frontiers.

[3] Nicolas Macris,et al. Entropy and mutual information in models of deep neural networks , 2018, NeurIPS.

[4] Thomas Emerson,et al. The Second International Chinese Word Segmentation Bakeoff , 2005, IJCNLP.

[5] Masayuki Suzuki,et al. Improvements to N-gram Language Model Using Text Generated from Neural Language Model , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Ted Dunning,et al. Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[7] Du Li-pin. New Word Detection Based on an Improved PMI Algorithm for Enhancing Segmentation System , 2016 .

[8] Delphine Bernhard,et al. From Non Word to New Word: Automatically Identifying Neologisms in French Newspapers , 2014, LREC.

[9] Yehuda Afek,et al. Detecting Heavy Flows in the SDN Match and Action Model , 2017, Comput. Networks.

[10] Xiao Sun,et al. New word detection and emotional tendency judgment based on mixed model , 2014, 2014 IEEE 3rd International Conference on Cloud Computing and Intelligence Systems.

[11] Xing Xie,et al. Neural Chinese word segmentation with dictionary , 2019, Neurocomputing.