The estimation of powerful language models from small and large corpora

The authors consider the estimation of powerful statistical language models using a technique that scales from very small to very large amounts of domain-dependent data. They begin with improved modeling of the grammar statistics, based on a combination of the backing-off technique and zero-frequency techniques. These are extended to be more amenable to the particular system considered here. The resulting technique is greatly simplified, more robust, and gives improved recognition performance over either of the previous techniques. The authors also consider the problem of robustness of a model based on a small training corpus by grouping words into obvious semantic classes. This significantly improves the robustness of the resulting statistical grammar. A technique that allows the estimation of a high-order model on modest computation resources is also presented. This makes it possible to run a 4-gram statistical model of a 50-million word corpus on a workstation of only modest capability and cost. Finally, the authors discuss results from applying a 2-gram statistical language model integrated in the HMM (hidden Markov model) search, obtaining a list of the N-best recognition results, and rescoring this list with a higher-order statistical model.<<ETX>>

[1]  Ian H. Witten,et al.  The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[2]  Richard M. Schwartz,et al.  A Simple Statistical Class Grammar for Measuring Speech Recognition Performance , 1989, HLT.

[3]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[4]  Volker Steinbiss,et al.  Cooccurrence smoothing for stochastic language modeling , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Francis Kubala,et al.  New uses for the N-Best sentence hypotheses within the BYBLOS speech recognition system , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Richard M. Schwartz,et al.  Toward a Real-Time Spoken Language System Using Commercial Hardware , 1990, HLT.

[7]  Masafumi Nishimura,et al.  Isolated word recognition using hidden Markov models , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Madeleine Bates,et al.  A Proposal for SLS Evaluation , 1989, HLT.

[9]  Ian H. Witten,et al.  Text Compression , 1990, 125 Problems in Text Algorithms.