论文信息 - A multi-stream ASR framework for BLSTM modeling of conversational speech

A multi-stream ASR framework for BLSTM modeling of conversational speech

We propose a novel multi-stream framework for continuous conversational speech recognition which employs bidirectional Long Short-Term Memory (BLSTM) networks for phoneme prediction. The BLSTM architecture allows recurrent neural nets to model long-range context, which led to improved ASR performance when combined with conventional triphone modeling in a Tandem system. In this paper, we extend the principle of joint BLSTM and triphone modeling to a multi-stream system which uses MFCC features and BLSTM predictions as observations originating from two independent data streams. Using the COSINE database, we show that this technique prevails over a recently proposed single-stream Tandem system as well as over a conventional HMM recognizer.

[1] Jithendra Vepa,et al. An Acoustic Model Based on Kullback-Leibler Divergence for Posterior Features , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[2] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[3] Jürgen Schmidhuber,et al. Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[4] Hynek Hermansky,et al. Multi-resolution RASTA filtering for TANDEM-based ASR , 2005, INTERSPEECH.

[5] Björn W. Schuller,et al. Combining Long Short-Term Memory and Dynamic Bayesian Networks for Incremental Emotion-Sensitive Artificial Listening , 2010, IEEE Journal of Selected Topics in Signal Processing.

[6] Björn Schuller,et al. Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[7] Yoshua Bengio,et al. Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[8] S. C. Kremer,et al. Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[9] Jeff A. Bilmes,et al. COSINE - A corpus of multi-party COnversational Speech In Noisy Environments , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10] Jürgen Schmidhuber,et al. Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition , 2005, ICANN.

[11] Jürgen Schmidhuber,et al. An Application of Recurrent Neural Networks to Discriminative Keyword Spotting , 2007, ICANN.

[12] Björn W. Schuller,et al. Spoken term detection with Connectionist Temporal Classification: A novel hybrid CTC-DBN decoder , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13] Björn W. Schuller,et al. Recognition of Noisy Speech: A Comparative Survey of Robust Model Architecture and Feature Enhancement , 2009, EURASIP J. Audio Speech Music. Process..

[14] Björn W. Schuller,et al. Recognition of spontaneous conversational speech using long short-term memory phoneme predictions , 2010, INTERSPEECH.

[15] David Barber,et al. Switching Linear Dynamical Systems for Noise Robust Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[16] Björn W. Schuller,et al. Robust discriminative keyword spotting for emotionally colored spontaneous speech using bidirectional LSTM networks , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17] Björn W. Schuller,et al. Bidirectional LSTM Networks for Context-Sensitive Keyword Detection in a Cognitive Virtual Agent Framework , 2010, Cognitive Computation.

[18] Frantisek Grézl,et al. Optimizing bottle-neck features for lvcsr , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.