Improving Language Models by Clustering Training Sentences

Many of the kinds of language model used in speech understanding suffer from imperfect modeling of intra-sentential contextual influences. I argue that this problem can be addressed by clustering the sentences in a training corpus automatically into subcorpora on the criterion of entropy reduction, and calculating separate language model parameters for each cluster. This kind of clustering offers a way to represent important contextual effects and can therefore significantly improve the performance of a model. It also offers a reasonably automatic means to gather evidence on whether a more complex, context-sensitive model using the same general kind of linguistic information is likely to reward the effort that would be required to develop it: if clustering improves the performance of a model, this proves the existence of further context dependencies, not exploited by the unclustered model. As evidence for these claims, I present results showing that clustering improves some models but not others for the ATIS domain. These results are consistent with other findings for such models, suggesting that the existence or otherwise of an improvement brought about by clustering is indeed a good pointer to whether it is worth developing further the unclustered model.

[1]  B. S. Everitt,et al.  Cluster analysis , 2014, Encyclopedia of Social Network Analysis and Mining.

[2]  Hermann Ney,et al.  On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[3]  Ted Briscoe,et al.  Generalized Probabilistic LR Parsing of Natural Language (Corpora) with Unification-Based Grammars , 1993, CL.

[4]  Mari Ostendorf,et al.  Language Modeling with Sentence-Level Mixtures , 1994, HLT.

[5]  Manny Rayner,et al.  Quantitative Evaluation of Explanation-Based Learning as an Optimisation Tool for a Large-Scale Natural Language System , 1991, IJCAI.

[6]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[7]  Vassilios Digalakis,et al.  Combining Knowledge Sources to Reorder N-Best Speech Hypothesis Lists , 1994, HLT.

[8]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[9]  W. W. Daniel Applied Nonparametric Statistics , 1979 .

[10]  Mitch Weintraub,et al.  Training Set Issues in SRI's DECIPHER Speech Recognition System , 1990, HLT.

[11]  J. R. Rohlicek,et al.  Boston University College of Engineering Thesis Language Modeling with Sentence-level Mixtures , 2022 .

[12]  Hiyan Alshawi,et al.  Training and Scaling Preference Functions for Disambiguation , 1994, Comput. Linguistics.

[13]  M-S. Agnas et al,et al.  Spoken Language Translator: First-Year Report , 1994 .

[14]  Ronald Rosenfeld A Hybrid Approach to Adaptive Statistical Language Modeling , 1994, HLT.

[15]  Stephen G. Pulman,et al.  A speech-based route enquiry system built from general-purpose components , 1993, EUROSPEECH.

[16]  Mitch Weintraub,et al.  Large-vocabulary dictation using SRI's DECIPHER speech recognition system: progressive search techniques , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[18]  Bernard Mérialdo,et al.  A Dynamic Language Model for Speech Recognition , 1991, HLT.