Investigations on exemplar-based features for speech recognition towards thousands of hours of unsupervised, noisy data

The acoustic models in state-of-the-art speech recognition systems are based on phones in context that are represented by hidden Markov models. This modeling approach may be limited in that it is hard to incorporate long-span acoustic context. Exemplar-based approaches are an attractive alter-native, in particular if massive data and computational power are available. Yet, most of the data at Google are unsupervised and noisy. This paper investigates an exemplar-based approach under this yet not well understood data regime. A log-linear rescoring framework is used to combine the exemplar-based features on the word level with the first-pass model. This approach guarantees at least baseline performance and focuses on the refined modeling of words with sufficient data. Experimental results for the Voice Search and the YouTube tasks are presented.

[1]  Patrick Wambacq,et al.  Template-Based Continuous Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Geoffrey Zweig,et al.  Speech recognitionwith segmental conditional random fields: A summary of the JHU CLSP 2010 Summer Workshop , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Geoffrey Zweig,et al.  A segmental CRF approach to large vocabulary continuous speech recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[4]  S. Axelrod,et al.  Combination of hidden Markov models with dynamic time warping for speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Tara N. Sainath,et al.  Exemplar-Based Sparse Representation Features: From TIMIT to LVCSR , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Geoffrey Zweig,et al.  A flat direct model for speech recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Georg Heigold,et al.  Modified MPE/MMI in a transducer-based framework , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Stanley F. Chen,et al.  A Gaussian Prior for Smoothing Maximum Entropy Models , 1999 .

[9]  Hermann Ney,et al.  Cross-Site and Intra-Site ASR System Combination: Comparisons on Lattice and 1-Best Methods , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[10]  Oded Ghitza,et al.  Hidden Markov models with templates as non-stationary states: an application to speech recognition , 1993, Comput. Speech Lang..