Interpolation of n-gram and mutual-information based trigger pair language models for Mandarin speech recognition

While n-gram modeling is simple and dominant in speech recognition, it can only capture the short-distance context dependency within an n-word window where currently the largest practical n for natural language is three. However, many of the context dependencies in natural language occur beyond a three-word window. This paper proposes a new language modeling approach to capture the preferred relationships between words over a short or long distance through the concept of MI-Trigger pairs. Different MI-Trigger-based models are constructed in either a distance-dependent or a distance-independent way within a window from 1 to 10 words. This new MI-Trigger-based modeling is also compared and merged with word bigram modeling. It is found that the MI-Trigger-based modeling has better performance than word bigram modeling. It is also found that n-gram and MI-Trigger models have good complementarity and their proper merging can further increase the recognition rate when tested on Mandarin speech recognition. One advantage of MI-Trigger-based modeling is that the number of parameters needed for MI-Trigger modeling is much less than that of word bigram modeling. Another advantage is that the number of trigger pairs in an MI-Trigger model can be kept to a reasonable size without losing too much of its modeling power. c © 1999 Academic Press

[1]  Ronald Rosenfeld,et al.  Adaptive Statistical Language Modeling; A Maximum Entropy Approach , 1994 .

[2]  Kenneth Ward Church,et al.  Poor Estimates of Context are Worse than None , 1990, HLT.

[3]  Mei-Yuh Hwang,et al.  The SPHINX-II speech recognition system: an overview , 1993, Comput. Speech Lang..

[4]  Chiu-yu Tseng,et al.  Golden Mandarin (III)-a user-adaptive prosodic-segment-based Mandarin dictation machine for Chinese language with very large vocabulary , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[5]  Mitchell P. Marcus,et al.  Parsing a Natural Language Using Mutual Information Statistics , 1990, AAAI.

[6]  Chiu-yu Tseng,et al.  Golden Mandarin (II)-an improved single-chip real-time Mandarin dictation machine for Chinese language with very large vocabulary , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Takenobu Tokunaga,et al.  Analysis of Japanese Compound Nouns using Collocational Information , 1994, COLING.

[8]  Gareth J. F. Jones,et al.  A consolidated language model for speech recognition , 1993, EUROSPEECH.

[9]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[10]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[11]  Nicoletta Calzolari,et al.  Acquisition of Lexical Information from a Large Textual Italian Corpus. , 1996 .

[12]  Michael R. Brent,et al.  From Grammar to Lexicon: Unsupervised Learning of Lexical Syntax , 1993, Comput. Linguistics.

[13]  Mats Rooth,et al.  Structural Ambiguity and Lexical Relations , 1991, ACL.

[14]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[15]  Hy Murveit,et al.  Integrating natural language constraints into HMM-based speech recognition , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[16]  Kenneth Ward Church,et al.  Enhanced Good-Turing and Cat-Cal: Two New Methods for Estimating Probabilities of English Bigrams (abbreviated version) , 1989, HLT.