Error feedback based lexical entity extraction for Chinese language modeling

Chinese, which is quite different from western languages, has no standard definition of word. Therefore, choosing suitable lexicon plays an important role in Chinese language modeling. This paper proposes a novel method of constructing the lexicon automatically. Other than depending on statistical measures of text features, this method is directly based on the feedback of errors from the corresponding task, such as phoneme-to-grapheme conversion in this paper. The whole process consists of two iterative phases: selection of individual words from a large manual lexicon and further extraction of compound words based on Phase One. Experiments implemented on phoneme-to-grapheme conversion show that this method can achieve 1.09% and 0.38% absolute reduction in character error rate respectively for Phase One and Phase Two compared with baseline lexicons in the same size generated by the conventional method based on word frequency.

[1]  Jianfeng Gao,et al.  Extraction of Chinese Compound Words - An Experimental Study on a Very Large Corpus , 2000, ACL 2000.

[2]  Lee-Feng Chien,et al.  PAT-tree-based keyword extraction for Chinese information retrieval , 1997, SIGIR '97.

[3]  André Berton,et al.  Compound words in large-vocabulary German speech recognition systems , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[4]  Jie Zhu,et al.  Bootstrap method for Chinese new words extraction , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[5]  Ted Rule,et al.  Chinese Language , 2009 .

[6]  Lin-Shan Lee,et al.  Statistics-based segment pattern lexicon-a new direction for Chinese language modeling , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[7]  Maosong Sun,et al.  Two-Character Chinese Word Extraction Based on Hybrid of Internal and Contextual Measures , 2003, SIGHAN.

[8]  Xiaotie Deng,et al.  Accessor Variety Criteria for Chinese Word Extraction , 2004, CL.

[9]  Yi Liu,et al.  Multi-level Linguistic Knowledge Based Chinese Grapheme-to-Phoneme Conversion , 2013, IScIDE.

[10]  Ying Xiong,et al.  Toward a unified approach to lexicon optimization and perplexity minimization for Chinese language modeling , 2005, 2005 International Conference on Machine Learning and Cybernetics.

[11]  Yong Qin,et al.  Generating compound words with high order n-gram information in large vocabulary speech recognition systems , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Jianfeng Gao,et al.  Toward a unified approach to statistical language modeling for Chinese , 2002, TALIP.

[13]  Siu-Ming Yiu,et al.  Unknown Chinese word extraction based on variety of overlapping strings , 2013, Inf. Process. Manag..