Bag-of-words input for long history representation in neural network-based language models for speech recognition

In most of previous works on neural network based language models (NNLMs), the words are represented as 1-of-N encoded feature vectors. In this paper we investigate an alternative encoding of the word history, known as bag-of-words (BOW) representation of a word sequence, and use it as an additional input feature to the NNLM. Both the feedforward neural network (FFNN) and the long short-term memory recurrent neural network (LSTM-RNN) language models (LMs) with additional BOW input are evaluated on an English large vocabulary automatic speech recognition (ASR) task. We show that the BOW features significantly improve both the perplexity (PP) and the word error rate (WER) of a standard FFNN LM. In contrast, the LSTM-RNN LM does not benefit from such an explicit long context feature. Therefore the performance gap between feedforward and recurrent architectures for language modeling is reduced. In addition, we revisit the cache based LM, a seeming analog of the BOW for the count based LM, which was unsuccessful for ASR in the past. Although the cache is able to improve the perplexity, we only observe a very small reduction in WER. Index Terms: language modeling, speech recognition, bag-ofwords, feedforward neural networks, recurrent neural networks, long short-term memory, cache language model

[1]  Tara N. Sainath,et al.  Deep Neural Network Language Models , 2012, WLM@NAACL-HLT.

[2]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[3]  Jean-Luc Gauvain,et al.  Connectionist language modeling for large vocabulary continuous speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Bernard Mérialdo,et al.  A Dynamic Language Model for Speech Recognition , 1991, HLT.

[6]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[7]  Dietrich Klakow,et al.  Testing the correlation of word error rate and perplexity , 2002, Speech Commun..

[8]  Hermann Ney,et al.  Adaptive topic - dependent language modelling using word - based varigrams , 1997, EUROSPEECH.

[9]  Anthony J. Robinson,et al.  Language model adaptation using mixtures and an exponentially decaying cache , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Holger Schwenk,et al.  Continuous space language models , 2007, Comput. Speech Lang..

[11]  Hermann Ney,et al.  rwthlm - the RWTH aachen university neural network language modeling toolkit , 2014, INTERSPEECH.

[12]  Ian R. Lane,et al.  Neural network language models for low resource languages , 2014, INTERSPEECH.

[13]  Renato De Mori,et al.  A Cache-Based Natural Language Model for Speech Recognition , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Dietrich Klakow,et al.  Log-linear interpolation of language models , 1998, ICSLP.

[15]  Joshua Goodman,et al.  A bit of progress in language modeling , 2001, Comput. Speech Lang..

[16]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[17]  Alexandre Allauzen,et al.  Measuring the Influence of Long Range Dependencies with Neural Network Language Models , 2012, WLM@NAACL-HLT.

[18]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[19]  Ronald Rosenfeld,et al.  A maximum entropy approach to adaptive statistical language modelling , 1996, Comput. Speech Lang..

[20]  Jun Wu,et al.  Maximum entropy techniques for exploiting syntactic, semantic and collocational dependencies in language modeling , 2000, Comput. Speech Lang..

[21]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[22]  Hermann Ney,et al.  From Feedforward to Recurrent LSTM Neural Networks for Language Modeling , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  Francisco Casacuberta,et al.  Inference of stochastic regular languages through simple recurrent networks , 1993 .

[24]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[25]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[26]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[27]  Hermann Ney,et al.  Lattice decoding and rescoring with long-Span neural network language models , 2014, INTERSPEECH.

[28]  Hermann Ney,et al.  Algorithms for bigram and trigram word clustering , 1995, Speech Commun..

[29]  Stanley F. Chen,et al.  Evaluation Metrics For Language Models , 1998 .

[30]  Philip Clarkson,et al.  The applicability of adaptive language modelling for the broadcast news task , 1998, ICSLP.

[31]  Hermann Ney,et al.  Comparison of feedforward and recurrent neural network language models , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[32]  Hermann Ney,et al.  On the Estimation of Discount Parameters for Language Model Smoothing , 2011, INTERSPEECH.

[33]  Geoffrey Zweig,et al.  Context dependent recurrent neural network language model , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).