Analyzing and Improving Statistical Language Models for Speech Recognition

A speech recognizer is a device that translates speech into text. Many current speech recognizers contain two components, an acoustic model and a statistical language model. The acoustic model indicates how likely it is that a certain word corresponds to a part of the acoustic signal (e.g. the speech). The statistical language model indicates how likely it is that a certain word will be spoken next, given the words recognized so far. Even though the acoustic model might for example not be able to decide between the acoustically similar words "peach" and "teach", the statistical language model can indicate that the word "peach" is more likely if the previously recognized words are "He ate the". Current speech recognizers perform well on constrained tasks, but the goal of continuous, speaker independent speech recognition in potentially noisy environments with a very large vocabulary has not been reached so far. How can statistical language models be improved so that more complex tasks can be tackled? This is the question addressed in this thesis. Since the knowledge of the weaknesses of any theory often makes improving the theory easier, the central idea of this thesis is to analyze the weaknesses of existing statistical language models in order to subsequently improve them. To that end, we formally define a weakness of a statistical language model in terms of the logarithm of the total probability, LTP, a term closely related to the standard perplexity measure used to evaluate statistical language models. This definition is applicable to many probabilistic models, including almost all of the currently used statistical language models. We apply our definition of a weakness to a frequently used statistical language model, called a bi-pos model. This results, for example, in a new modeling of unknown words which improves the performance of the model by 14% to 21%. Moreover, one of the identified weaknesses has prompted the development of our generalized N-pos language model, which is also outlined in this thesis. It can incorporate linguistic knowledge even if it extends over many words and this is not feasible in a traditional N-pos model. This leads to a discussion of what knowledge should be added to statistical language models in general and we give criteria for selecting potentially useful knowledge. These results show the usefulness of both our definition of a weakness and of performing an analysis of weaknesses of statistical language models in general.

[1]  Sidney Greenbaum,et al.  A new corpus of English: ICE , 1992 .

[2]  Jan Svartvik,et al.  The London-Lund corpus of spoken english , 1990 .

[3]  O. E. Dial,et al.  The social impact of computers , 1970, AFIPS '70 (Spring).

[4]  Steve Young,et al.  Applications of stochastic context-free grammars using the Inside-Outside algorithm , 1990 .

[5]  Michael R. Brent,et al.  Automatic Acquisition of Subcategorization Frames from Tagged Text , 1991, HLT.

[6]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[7]  P. Resnik Selection and information: a class-based approach to lexical relationships , 1993 .

[8]  Stephanie Seneff TINA. A probabilistic syntactic parser for speech understanding systems , 1989 .

[9]  Victor W. Zue,et al.  Integrating probabilistic LR parsing into speech understanding systems , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  S. Johansson Some observations on word frequencies in three corpora of present-day English texts , 1985 .

[11]  R. Schwartz,et al.  The N-best algorithms: an efficient and exact procedure for finding the N most likely sentence hypotheses , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[12]  Padhraic Smyth,et al.  An Information Theoretic Approach to Rule Induction from Databases , 1992, IEEE Trans. Knowl. Data Eng..

[13]  J. H. Wright,et al.  LR parsing of probabilistic grammars with input uncertainty for speech recognition , 1990 .

[14]  Frederick Jelinek,et al.  Up from trigrams! - the struggle for improved language models , 1991, EUROSPEECH.

[15]  Julian Kupiec,et al.  Probabilistic Models of Short and Long Distance Word Dependencies in Running Text , 1989, HLT.

[16]  V. Rich Personal communication , 1989, Nature.

[17]  J. Sinclair Collocation: a progress report , 1987 .

[18]  Stig Johansson Word frequency and text type: Some observations based on the LOB corpus of British English texts , 1985, Comput. Humanit..

[19]  Giorgio Satta,et al.  Computation of Probabilities for an Island-Driven Parser , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  Tomek Strzalkowski,et al.  TTP: A Fast and Robust Parser for Natural Language , 1992, COLING.

[21]  Raj Reddy,et al.  Large-vocabulary speaker-independent continuous speech recognition: the sphinx system , 1988 .

[22]  Fergus R. McInnes An enhanced interpolation technique for context-specific probability estimation in speech and language modelling , 1992, ICSLP.

[23]  Randolph Quirk,et al.  On corpus principles and design , 1992 .

[24]  Kiyohiro Shikano,et al.  Japanese phonetic typewriter using HMM phone units and syllable trigrams , 1990, ICSLP.

[25]  W. A. Martin,et al.  Parsing , 1980, ACL.

[26]  Steve J. Young,et al.  A trellis-based language model for speech recognition , 1992, ICSLP.

[27]  Victor Lesser,et al.  The hearsay-II speech understanding system: a tutorial , 1990 .

[28]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Anna Maria Colla,et al.  Automatic diphone bootstrapping for speaker-adaptive continuous speech recognition , 1984, ICASSP.

[30]  Kenneth Ward Church,et al.  Parsing, Word Associations and Typical Predicate-Argument Relations , 1989, HLT.

[31]  Hy Murveit,et al.  Integrating natural language constraints into HMM-based speech recognition , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[32]  Julian Kupiec A Trellis-Based Algorithm For Estimating The Parameters Of Hidden Stochastic Context-Free Grammar , 1991, HLT.

[33]  Carl de Marcken,et al.  Parsing the LOB Corpus , 1990, ACL.

[34]  Alex Waibel,et al.  Readings in speech recognition , 1990 .

[35]  Sargur N. Srihari,et al.  Experiments in Text Recognition with Binary n-Gram and Viterbi Algorithms , 1982, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Mats Rooth,et al.  Structural Ambiguity and Lexical Relations , 1991, ACL.

[37]  Kenneth Ward Church,et al.  A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams , 1991 .

[38]  John D. Lafferty,et al.  Towards History-based Grammars: Using Richer Models for Probabilistic Parsing , 1993, ACL.

[39]  John D. Lafferty,et al.  Decision Tree Models Applied to the Labeling of Text with Parts-of-Speech , 1992, HLT.

[40]  Pascale Fung,et al.  The estimation of powerful language models from small and large corpora , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[41]  James L. McClelland,et al.  Learning and Applying Contextual Constraints in Sentence Comprehension , 1990, Artif. Intell..

[42]  G. Zipf The meaning-frequency relationship of words. , 1945, The Journal of general psychology.

[43]  Paul Mermelstein,et al.  Experiments in syllable-based recognition of continuous speech , 1980, ICASSP.

[44]  Christopher D. Manning Automatic Acquisition of a Large Sub Categorization Dictionary From Corpora , 1993, ACL.

[45]  Joerg P. Ueberla Analysing a simple language model·some general conclusions for language models for speech recognition , 1994, Comput. Speech Lang..

[46]  Mark Liberman,et al.  Text on Tap: the ACL/DCI , 1989, HLT.

[47]  Takao Watanabe Segmentation-free syllable recognition in continuously spoken Japanese , 1983, ICASSP.

[48]  H. Kucera,et al.  Computational analysis of present-day American English , 1967 .

[49]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[50]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .

[51]  M. El-Beze,et al.  Three different probabilistic language models: comparison and combination , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[52]  Mari Ostendorf,et al.  Probabilistic parse scoring with prosodic information , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[53]  Hermann Ney,et al.  Continuous-speech recognition using a stochastic language model , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[54]  Kenji Kita,et al.  Incorporating LR parsing into SPHINX , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[55]  Renato De Mori,et al.  A Cache-Based Natural Language Model for Speech Recognition , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[56]  Steven J. DeRose,et al.  Grammatical Category Disambiguation by Statistical Optimization , 1988, CL.

[57]  Lalit R. Bahl,et al.  A tree-based statistical language model for natural language speech recognition , 1989, IEEE Trans. Acoust. Speech Signal Process..

[58]  Edward J. Delp,et al.  An iterative growing and pruning algorithm for classification tree design , 1989, Conference Proceedings., IEEE International Conference on Systems, Man and Cybernetics.

[59]  J. E. Freund,et al.  Modern Elementary Statistics , 1968 .

[60]  Maurice Fréchet,et al.  Méthode des fonctions arbitraires : théorie des événements en chaîne dans le cas d'un nombre fini d'états possibles , 1938 .

[61]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[62]  Fernando Pereira,et al.  Inside-Outside Reestimation From Partially Bracketed Corpora , 1992, HLT.

[63]  R. Mahesh K. Sinha,et al.  Visual text recognition through contextual processing , 1988, Pattern Recognit..

[64]  Hermann Ney,et al.  Improved clustering techniques for class-based statistical language modelling , 1993, EUROSPEECH.

[65]  Reinhard Kneser,et al.  On the dynamic adaptation of stochastic language models , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[66]  Hermann Ney,et al.  On smoothing techniques for bigram-based natural language modelling , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[67]  Allen R. Hanson,et al.  A Contextual Postprocessing System for Error Correction Using Binary n-Grams , 1974, IEEE Transactions on Computers.

[68]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[69]  Alex Waibel,et al.  Robust connectionist parsing of spoken language , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[70]  Kenji Kita,et al.  Linguistic constraints for continuous speech recognition in goal-directed dialogue , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[71]  Yves Schabes,et al.  Parsing the Wall Street Journal with the Inside-Outside Algorithm , 1993, EACL.

[72]  J. Mariani,et al.  Recent advances in speech processing , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[73]  Ronald Rosenfeld,et al.  Adaptive Language Modeling Using the Maximum Entropy Principle , 1993, HLT.

[74]  S. Impedovo,et al.  Optical Character Recognition - a Survey , 1991, Int. J. Pattern Recognit. Artif. Intell..

[75]  Geoffrey Leech,et al.  The tagged LOB Corpus : user's manual , 1986 .

[76]  Petra Witschel,et al.  Constructing linguistic oriented language models for large vocabulary speech recognition , 1993, EUROSPEECH.

[77]  J.-L. Gauvain,et al.  A syllable-based isolated word recognition experiment , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[78]  Ronald Rosenfeld A Hybrid Approach to Adaptive Statistical Language Modeling , 1994, HLT.

[79]  Robert L. Mercer,et al.  Adaptive Language Modeling Using Minimum Discriminant Estimation , 1992, HLT.

[80]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[81]  Bernard Mérialdo,et al.  A Dynamic Language Model for Speech Recognition , 1991, HLT.

[82]  K. Shikano,et al.  Task adaptation in stochastic language models for continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[83]  Bernard Comrie,et al.  Language Universals and Linguistic Typology: Syntax and Morphology , 1981 .

[84]  Johansson. Stig,et al.  Manual of information to accompany the Lancaster-Oslo : Bergen Corpus of British English, for use with digital computers , 1978 .

[85]  Roland Kuhn,et al.  Speech Recognition and the Frequency of Recently Used Words: A Modified Markov Model for Natural Language , 1988, COLING.

[86]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[87]  Bernard Mérialdo,et al.  Natural Language Modeling for Phoneme-to-Text Transcription , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[88]  Julian M. Kupiec,et al.  Robust part-of-speech tagging using a hidden Markov model , 1992 .

[89]  Chin-Hui Lee,et al.  Factorization of Language Constraints in Speech Recognition , 1991, ACL.

[90]  Steven Finch,et al.  Finding structure in language , 1995 .

[91]  Luciano Fissore,et al.  Experimental evaluation of Italian language models for large-dictionary speech recognition , 1987, ECST.

[92]  Frederick Jelinek,et al.  Classifying words for improved statistical language models , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[93]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[94]  Donald E. Walker,et al.  The Ecology of Language , 1994 .

[95]  Hermann Ney,et al.  On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[96]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[97]  Robert L. Mercer,et al.  An Estimate of an Upper Bound for the Entropy of English , 1992, CL.

[98]  M. W. Shields An Introduction to Automata Theory , 1988 .

[99]  Hermann Ney,et al.  Estimating 'small' probabilities by leaving-one-out , 1993, EUROSPEECH.

[100]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[101]  Ronald Rosenfeld,et al.  Trigger-based language models: a maximum entropy approach , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[102]  John Makhoul,et al.  Context-dependent modeling for acoustic-phonetic recognition of continuous speech , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[103]  David Goodine,et al.  Integrating Syntax and Semantics into Spoken Language Understanding , 1991, HLT.

[104]  John D. Lafferty,et al.  Computation of the Probability of Initial Substring Generation by Stochastic Context-Free Grammars , 1991, Comput. Linguistics.

[105]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[106]  Kiyohiro Shikano,et al.  Isolated word recognition using phoneme-like templates , 1983, ICASSP.

[107]  Vishwa Gupta,et al.  Three probabilistic language models for a large-vocabulary speech recognizer , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[108]  Tom Forester Computers in the human context: information technology, productivity, and people , 1989 .

[109]  Günther Ruske,et al.  The efficiency of demisyllable segmentation in the recognition of spoken words , 1981, ICASSP.

[110]  A.-M. Derouault,et al.  A morphological model for large vocabulary speech recognition , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[111]  Volker Steinbiss,et al.  Cooccurrence smoothing for stochastic language modeling , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[112]  Geoffrey Leech,et al.  Running a grammar factory: The production of syntactically analysed corpora or “treebanks” , 1991 .

[113]  Petra Witschel,et al.  Experiments in Dialogue Context Dependent Language Modelling , 1992, KONVENS.

[114]  Lalit R. Bahl,et al.  Recognition of continuously read natural corpus , 1978, ICASSP.

[115]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[116]  Stephanie Seneff,et al.  TINA: A Probabilistic Syntactic Parser for Speech Understanding Systems , 1989, HLT.

[117]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[118]  Taylor L. Booth,et al.  Grammatical Inference: Introduction and Survey-Part I , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.