Research on knowledge elements in exponential language model

This paper presents an exponential language model (ELM) for modeling and managing knowledge elements. The model has been developed based on Minimum Sample Risk (MSR) algorithm, which is a discriminative training method. ELM uses features to capture global, domain, or sentential language phenomena that is composed of name entities, part of speech strings, personal usage words, positions of words, sentence mood, sentence tense etc. We study kinds of knowledge elements' performances on the task of Chinese Pinyin to Chinese character (PTC) conversion in Internet language (Chinese mobile short messages and Chinese QQ1 chat records). By combining different kind of knowledge elements to ELM, the model performs different, but all ELMs with more knowledge elements outperform the ELM only using probability knowledge calculated by baseline n-gram models which use Kneser-Ney smoothing technology.

[1]  Yih-Jeng Lin,et al.  Extracting Chinese Frequent Strings Without Dictionary From a Chinese corpus, its Applications , 2001, J. Inf. Sci. Eng..

[2]  Hermann Ney,et al.  Generation of Word Graphs in Statistical Machine Translation , 2002, EMNLP.

[3]  Jianfeng Gao,et al.  A Comparative Study on Language Model Adaptation Techniques Using New Evaluation Metrics , 2005, HLT.

[4]  R. Schwartz,et al.  The N-best algorithms: an efficient and exact procedure for finding the N most likely sentence hypotheses , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[5]  Roni Rosenfeld,et al.  Minimum Classification Error Training in Exponential Language Models , 2000 .

[6]  Yih-Jeng Lin,et al.  Extracting Chinese Frequent Strings Without a Dictionary From a Chinese Corpus and its Applications , 2001 .

[7]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[8]  Zhu Qiaoming,et al.  A Dynamic and Self-study Language Model Oriented to Chinese Characters Input , 2006 .

[9]  Yih-Jeng Lin,et al.  The Properties and Further Applications of Chinese Frequent Strings , 2004, ROCLING/IJCLCLP.

[10]  Zhong Yi-xin Cascade Identification of Chinese Chunks , 2008 .

[11]  Ronald Rosenfeld,et al.  Whole-sentence exponential language models: a vehicle for linguistic-statistical integration , 2001, Comput. Speech Lang..

[12]  Wei Yuan,et al.  Minimum Sample Risk Methods for Language Modeling , 2005, HLT/EMNLP.

[13]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .