Lexicon adaptation with reduced character error (LARCE) - a new direction in Chinese language modeling

Good language modeling relies on good predefined lexicons. For Chinese, since there are no text word boundaries and the concept of “word” is not very well defined, constructing good lexicons is difficult. In this paper, we propose lexicon adaptation with reduced character error (LARCE), which learns new word tokens based on the criterion of reduced adaptation corpus error rate. In this approach, a multi-character string is taken as a new “word” as long as it is helpful in reducing the error rate, and minimum number of new, high-quality words can be obtained. This algorithm is based on character-based consensus networks. In initial experiments on Chinese broadcast news, it is shown that LARCE not only significantly outperforms PAT-tree-based word extraction algorithms, but even outperforms manually augmented lexicons. It is believed the concept is equally useful for other character-based languages.

[1]  Andreas Stolcke,et al.  Finding consensus among words: lattice-based word error minimization , 1999, EUROSPEECH.

[2]  Hermann Ney,et al.  A word graph algorithm for large vocabulary continuous speech recognition , 1994, Comput. Speech Lang..

[3]  Frank K. Soong,et al.  Tone-enhanced generalized character posterior probability (GCPP) for Cantonese LVCSR , 2008, Comput. Speech Lang..

[4]  Jianfeng Gao,et al.  Toward a unified approach to statistical language modeling for Chinese , 2002, TALIP.

[5]  Hermann Ney,et al.  Confidence measures for large vocabulary continuous speech recognition , 2001, IEEE Trans. Speech Audio Process..

[6]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Changning Huang,et al.  Chinese Word Segmentation: A Pragmatic Approach , 2004 .

[8]  Chorkin Chan,et al.  Chinese Word Segmentation based on Maximum Matching and Word Binding Force , 1996, COLING.

[9]  George Saon,et al.  Data-driven approach to designing compound words for continuous speech recognition , 2001, IEEE Trans. Speech Audio Process..

[10]  Lin-Shan Lee,et al.  Improved Large Vocabulary Continuous Chinese Speech Recognition by Character-Based Consensus Networks , 2006, ISCSLP.

[11]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Lee-Feng Chien,et al.  PAT-tree-based keyword extraction for Chinese information retrieval , 1997, SIGIR '97.