Enhancing Exemplar-Based Posteriors for Speech Recognition Tasks

Posteriors generated from exemplar-based sparse representation methods are often learned to minimize reconstruction error of the feature vectors. These posteriors are not learned through a discriminative process linked to the word error rate (WER) objective of a speech recognition task. In this paper, we explore modeling exemplar-based posteriors to address this issue. We first explore posterior modeling by training a Neural Network using exemplar-based posteriors as inputs. This produces a new set of posteriors which have been learned to minimize a cross-entropy measure, and indirectly frame error rate. Second, we take the new NN posteriors and apply a tied mixture smoothing technique to these posteriors, making them more suited for a speech recognition task. On the TIMIT task, we show that using a NN model, we can improve the performance of our sparse representations by 1.3% absolute, achieving a PER of 19.0% by modeling SR posteriors with a NN. Furthermore, taking these NN posteriors and applying further smoothing techniques, we improve the PER to 18.7%, one of the best results reported in the literature on TIMIT.

[1]  Brian Kingsbury,et al.  The IBM Attila speech recognition toolkit , 2010, 2010 IEEE Spoken Language Technology Workshop.

[2]  Jithendra Vepa,et al.  An Acoustic Model Based on Kullback-Leibler Divergence for Posterior Features , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[3]  Tara N. Sainath,et al.  Deep Belief Networks using discriminative features for phone recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Sanjeev Khudanpur,et al.  Dirichlet Mixture Models of neural net posteriors for HMM-based speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Brian Kingsbury,et al.  Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Tara N. Sainath,et al.  A convex hull approach to sparse representations for exemplar-based speech recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[7]  Tuomas Virtanen,et al.  Noise robust exemplar-based connected digit recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Tara N. Sainath,et al.  Exemplar-Based Sparse Representation Features: From TIMIT to LVCSR , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Jerome R. Bellegarda,et al.  Tied mixture continuous parameter modeling for speech recognition , 1990, IEEE Trans. Acoust. Speech Signal Process..

[10]  Patrick Wambacq,et al.  Template-Based Continuous Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Tara N. Sainath,et al.  Exemplar-based Sparse Representation phone identification features , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).