Statistics-based segment pattern lexicon-a new direction for Chinese language modeling

This paper presents a new direction for Chinese language modeling based on a different concept of the lexicon. Because every Chinese character has its own meaning and there are no "blanks" in Chinese sentences serving as word boundaries, also because the wording structure in the Chinese language is extremely flexible, the "words" in Chinese are actually not well defined, and there does not exist a commonly accepted lexicon. This makes language modeling very sophisticated in the Chinese language, and the "out of vocabulary (OOV)" problem specially serious. A new concept for the lexicon is thus proposed. The elements of this lexicon can be words or any other "segment patterns". They should be extracted from the training corpus by statistical approaches with a goal to minimize the overall perplexity. The language models can then be developed based on this new lexicon. Very encouraging experimental results have been obtained.

[1]  Chorkin Chan,et al.  Ergodic multigram HMM integrating word segmentation and class tagging for Chinese language modeling , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[2]  Frédéric Bimbot,et al.  Language modeling by variable length sequences: theoretical formulation and evaluation of multigrams , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[3]  Hermann Ney,et al.  On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[4]  Reinhard Kneser,et al.  Statistical language modeling using a variable context length , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.