Automatic Acquisition of Phrase Grammars for Stochastic Language Modeling

Phrase-based language models have been recognized to have an advantage over word-based language models since they allow us to capture long spanning dependencies. Class based language models have been used to improve model generalization and overcome problems with data sparseness. In this paper, we present a novel approach for combining the phrase acquisition with class construction process to automatically acquire phrase-grammar fragments from a given corpus. The phrase-grammar learning is decomposed into two sub-problems, namely the phrase acquisition and feature selection. The phrase acquisition is based on entropy minimization and the feature selection is driven by the entropy reduction principle. We further demonstrate that the phrasegrammar based n-gram language model significantly outperforms a phrase-based n-gram language model in an end-to-end evaluation of a spoken language application. 1 I n t r o d u c t i o n Traditionally, n-gram language models implicitly assume words as the basic lexical unit. However, certain word sequences (phrases) are recurrent in constrained domain languages and can be thought as a single lexical entry (e.g. by and l a rg e , I would l i k e to , United S t a t e s of America, etc..). A traditional word n-gram based language model can benefit greatly by using variable length units to capture long spanning dependencies, for any given order n of the model. Furthermore, language modeling based on longer length units is applicable to languages which do not have a predefined notion of a word. However, the problem of data sparseness is more acute in phrase-based language models than in word-based language models. Clustering words into classes has been used to overcome data sparseness in word-based language models (et.al., 1992; Kneser and Ney, 1993; Pereira et al., 1993; McCandless and Glass, 1993; Bellegarda et al., 1996; Saul and Pereira, 1997). Although the automatically acquired phrases can be later clustered into classes to overcome data sparseness, we present a novel approach 188 of combining the construction of classes during the acquisition of phrases. This integration of phrase acquisition and class construction results in the acquisition of phrase-grammar fragments. In (Gorin, 1996; Arai et al., 1997), grammar fragment acquisition is performed through Kullback-Liebler divergence techniques with application to topic classification from text. Although phrase-grammar fragments reduce the problem of data sparseness, they can result in overgeneralization. For example, one of the classes induced in our experiments was C1 = {and, b u t , because} which one might call the class of conjunctions. However, this class was part of a phrasegrammar fragment such as A T C1 T which results in phrases A T and T, A T but T, A T because T a clear case of over-generalization given our corpus. Hence we need to further stochastically separate phrases generated by a phrase-grammar fragment. In this paper, we present our approach to integrating phrase acquisition and clustering and our technique to specialize the acquired phrase fragments. We extensively evaluate the effectiveness of phrasegrammar based n-gram language model and demonstrate that it outperforms a phrase-based n-gram language model in an end-to-end evaluation of a spoken language application. The outline of the paper is as follows. In Section 2, we review the phrase acquisition algorithm presented in (Riccardi et al., 1997). In Section 3, we discuss our approach to phrase acquisition and clustering respectively. The algorithm integrating the phrase acquisition and clustering processes is presented in Section 4. The spoken language application for automatic call routing (How May I Help You? (HMIHY)) that is used for evaluating our approach and the results of our experiments are described in Section 5. 2 L e a r n i n g P h r a s e s In previous work, we have shown the effectiveness of incorporating manually selected phrases for reducing the test set perplexity 1 and the word error rate of a large vocabulary recognizer (Riccardi et al., 1995; Riccardi et al., 1996). However, a critical issue for the design of a language model based on phrases is the algorithm that automatically chooses the units by optimizing a suitable cost function. For improving the prediction of word probabilities, the criterion we used is the minimization of the language perplexity PP(T) on a training corpus 7". This algorithm for extracting phrases from a training corpus is similar in spirit to (Giachin, 1995), but differs in the language model components and optimization parameters (Riccardi et al., 1997). In addition, we extensively evaluate the effectiveness of phrase n-gram (n > 2) language models by means of an end-to-end evaluation of a spoken language system (see Section 5). The phrase acquisition method is a greedy algorithm that performs local optimization based on an iterative process which converges to a local minimum of PP(T) . As depicted in Figure 1, the algorithm consists of three main parts: • Generation and ranking of a set of candidate phrases. This step is repeated at each iteration to constrain the search for all possible symbol sequences observed in the training corpus. • Each candidate phrase is evaluated in terms of the training set perplexity. • At the end of the iteration, the set of selected phrases is used to filter the training corpus and replace each occurrence of the phrase with a new lexical unit. The filtered training corpus will be referenced as TII. In the first step of the procedure, a set of candidate phrases (unit pairs) o is drawn out of a training corpus T and ranked according to a correlation coefficient. The most used measure for the interdependence of two events is the mutual information M I ( z , y ) = log P¢~4,1 However, in this experiP(z )P(y ) " ment, we use a correlation coefficient that has provided the best convergence speed for the optimization procedure: P(~' Y) (1) P~'~ P(z) + P(y) where P(z) is the probability of symbol z. The coefficient p~,y (0 _< p~,~ _< 0.5) is easily extended to define Pz~,z2 ..... z. for the n-tuple (xl , x2 . . . . . zn) (0 _< p~, ......... _< l /n) . Phrases (x, y) with high p~,y or ~.. P(x) = P(y). In M I ( z , y ) are such that P ( z , y ) _ lThe perplexity P P ( T ) of a corpus 7" is P P ( T ) = e x p ( ~ log P ( T ) ) , where n is the number of words in T. :aWe ranked symbol pairs and increased the phrase length by successive iteration. An additional speed up to the algori thm could be gained by ran.king symbol k-tuple$ (k > 2) at each iteration. 189 the case of P(z , y) = P(z) = P(y) , px,y = 0.5 while M I = logP(z) . Namely, the ranking by M I is biased towards events with low probability events which are not likely to be selected by our Maximum Likelihood algorithm. In fact, the phrase (z ,y) F[ Training Set Filtering Generate and r a n k candidate phrases s elect.atpdhrases V k perp'e 7 ,o,m,za,,o° | \

[1]  Fernando Pereira,et al.  Aggregate and mixed-order Markov models for statistical language processing , 1997, EMNLP.

[2]  Roberto Pieraccini,et al.  Non-deterministic stochastic language models for speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[3]  H. Schütze,et al.  Dimensions of meaning , 1992, Supercomputing '92.

[4]  Jerome R. Bellegarda,et al.  A novel word clustering algorithm based on latent semantic analysis , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[5]  Allen L. Gorin,et al.  Generating semantically consistent inputs to a dialog manager , 1997, EUROSPEECH.

[6]  Giuseppe Riccardi,et al.  Grammar Fragment acquisition using syntactic and semantic clustering , 1998, Speech Commun..

[7]  Allen L. Gorin,et al.  Processing of semantic information in fluently spoken language , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[8]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[9]  Hermann Ney,et al.  Improved clustering techniques for class-based statistical language modelling , 1993, EUROSPEECH.

[10]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[11]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[12]  James R. Glass,et al.  Empirical acquisition of word and phrase classes in the atis domain , 1993, EUROSPEECH.

[13]  Roberto Pieraccini,et al.  Stochastic automata for language modeling , 1996, Comput. Speech Lang..

[14]  Giuseppe Riccardi,et al.  How may I help you? , 1997, Speech Commun..