Incorporating sparse representation phone identification features in automatic speech recognition using exponential families

Sparse representation phone identification features (SPIF) is a recently developed technique to obtain an estimate of phone posterior probabilities conditioned on an acoustic feature vector. In this paper, we explore incorporating SPIF phone posterior probability estimates in large vocabulary continuous speech recognition (LVCSR) task by including them as additional features of exponential densities that model the HMM state emission likelihoods. We compare our proposed approach to a number of other well known methods of combining feature streams or multiple LVCSR systems. Our experiments show that using exponential models to combine features results in a word error rate reduction of 0.5% absolute (18.7% down to 18.2%); this is comparable to best error rate reduction obtained from system combination methods, but without having to build multiple systems or tune the system combination weights.

[1]  Hermann Ney,et al.  Acoustic feature combination for robust speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[2]  Brian Kingsbury,et al.  Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Brian Kingsbury,et al.  Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Sebastian Stüker,et al.  Cross-system adaptation and combination for continuous speech recognition: the influence of phoneme set and acoustic front-end , 2006, INTERSPEECH.

[5]  Xiaodong Cui,et al.  A comparative study on system combination schemes for LVCSR , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  William H. Press,et al.  Numerical recipes in C , 2002 .

[7]  Vaibhava Goel,et al.  Acoustic modeling using exponential families , 2009, INTERSPEECH.

[8]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[9]  Andreas Stolcke,et al.  THE SRI MARCH 2000 HUB-5 CONVERSATIONAL SPEECH TRANSCRIPTION SYSTEM , 2000 .

[10]  William H. Press,et al.  Numerical Recipes in C, 2nd Edition , 1992 .

[11]  D. Nahamoo,et al.  Sparse Representation Phone Identification Features for Speech Recognition , 2010 .