Bidirectional LSTM Networks for Context-Sensitive Keyword Detection in a Cognitive Virtual Agent Framework

Robustly detecting keywords in human speech is an important precondition for cognitive systems, which aim at intelligently interacting with users. Conventional techniques for keyword spotting usually show good performance when evaluated on well articulated read speech. However, modeling natural, spontaneous, and emotionally colored speech is challenging for today’s speech recognition systems and thus requires novel approaches with enhanced robustness. In this article, we propose a new architecture for vocabulary independent keyword detection as needed for cognitive virtual agents such as the SEMAINE system. Our word spotting model is composed of a Dynamic Bayesian Network (DBN) and a bidirectional Long Short-Term Memory (BLSTM) recurrent neural net. The BLSTM network uses a self-learned amount of contextual information to provide a discrete phoneme prediction feature for the DBN, which is able to distinguish between keywords and arbitrary speech. We evaluate our Tandem BLSTM-DBN technique on both read speech and spontaneous emotional speech and show that our method significantly outperforms conventional Hidden Markov Model-based approaches for both application scenarios.

[1]  Björn W. Schuller,et al.  Robust discriminative keyword spotting for emotionally colored spontaneous speech using bidirectional LSTM networks , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Jürgen Schmidhuber,et al.  Learning Complex, Extended Sequences Using the Principle of History Compression , 1992, Neural Computation.

[3]  Jeff A. Bilmes,et al.  Graphical models and automatic speech recognition , 2002 .

[4]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[5]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[6]  Björn W. Schuller,et al.  Abandoning emotion classes - towards continuous emotion recognition with modelling of long-range dependencies , 2008, INTERSPEECH.

[7]  Dirk Heylen,et al.  Towards responsive Sensitive Artificial Listeners , 2008 .

[8]  Sharon L. Oviatt,et al.  Multimodal interface research: a science without borders , 2000, INTERSPEECH.

[9]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[10]  A. Waibel,et al.  MULTIMODAL HUMAN-COMPUTER INTERACTION , 1993 .

[11]  Björn Schuller,et al.  Being bored? Recognising natural interest by extensive audiovisual integration for real-life application , 2009, Image Vis. Comput..

[12]  R. C. Rose,et al.  Keyword detection in conversational speech utterances using hidden Markov model based continuous speech recognition , 1995, Comput. Speech Lang..

[13]  Bhuvana Ramabhadran,et al.  Vocabulary independent spoken term detection , 2007, SIGIR.

[14]  Jürgen Schmidhuber,et al.  Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks , 2007, NIPS.

[15]  Sarel van Vuuren,et al.  Relevance of time-frequency features for phonetic and speaker-channel classification , 2000, Speech Commun..

[16]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[17]  Geoffrey Zweig,et al.  The graphical models toolkit: An open source software system for speech and time-series processing , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[19]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[20]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[21]  Björn W. Schuller,et al.  On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues , 2009, Journal on Multimodal User Interfaces.

[22]  Samy Bengio,et al.  Posterior based keyword spotting with a priori thresholds , 2006, INTERSPEECH.

[23]  Richard Rose,et al.  A hidden Markov model based keyword recognition system , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[24]  Trevor Darrell,et al.  Hidden Conditional Random Fields , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Stephen Cox,et al.  Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[26]  Leslie G. Valiant,et al.  Cognitive computation , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[27]  Peter Tiño,et al.  Learning long-term dependencies in NARX recurrent neural networks , 1996, IEEE Trans. Neural Networks.

[28]  Björn W. Schuller,et al.  A multidimensional dynamic time warping algorithm for efficient multimodal fusion of asynchronous data streams , 2009, Neurocomputing.

[29]  Geoffrey Zweig,et al.  Exact alpha-beta computation in logarithmic space with application to MAP word graph construction , 2000, INTERSPEECH.

[30]  Björn W. Schuller,et al.  Recognising interest in conversational speech - comparing bag of frames and supra-segmental features , 2009, INTERSPEECH.

[31]  Mitsuru Ishizuka,et al.  A chat system based on emotion estimation from text and embodied conversational messengers , 2005, Proceedings of the 2005 International Conference on Active Media Technology, 2005. (AMT 2005)..

[32]  Jeff A. Bilmes,et al.  A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[33]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[34]  Yoshua Bengio,et al.  Markovian Models for Sequential Data , 2004 .

[35]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[36]  A. Graves,et al.  Unconstrained Online Handwriting Recognition with Recurrent Neural Networks , 2007 .

[37]  Mark Johnson,et al.  Mathematical Foundations of Speech and Language Processing , 2004 .

[38]  Gérard Chollet,et al.  Confidence measures for keyword spotting using support vector machines , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[39]  Alex Graves,et al.  Connectionist Temporal Classification , 2012 .

[40]  Ronald J. Williams,et al.  Gradient-based learning algorithms for recurrent networks and their computational complexity , 1995 .

[41]  Hui Lin,et al.  Improving multi-lattice alignment based spoken keyword spotting , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[42]  J. Bilmes Gaussian Models in Automatic Speech Recognition , 2008 .

[43]  S. C. Kremer,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[44]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[45]  Kostas Karpouzis,et al.  The HUMAINE Database: Addressing the Collection and Annotation of Naturalistic and Induced Emotional Data , 2007, ACII.

[46]  Jürgen Schmidhuber,et al.  Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition , 2005, ICANN.

[47]  Björn W. Schuller,et al.  Data-driven clustering in emotional space for affect recognition using discriminatively trained LSTM networks , 2009, INTERSPEECH.

[48]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[49]  Björn W. Schuller,et al.  Robust vocabulary independent keyword spotting with graphical models , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[50]  Steffen Udluft,et al.  Learning long-term dependencies with recurrent neural networks , 2008, Neurocomputing.

[51]  Marcus Liwicki,et al.  A novel approach to on-line handwriting recognition based on bidirectional long short-term memory networks , 2007 .

[52]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[53]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[54]  Tom Ziemke,et al.  On the Role of Emotion in Embodied Cognitive Architectures: From Organisms to Robots , 2009, Cognitive Computation.

[55]  Jeff A. Bilmes,et al.  Maximum mutual information based reduction strategies for cross-correlation based joint distributional modeling , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[56]  J.A. Bilmes,et al.  Graphical model architectures for speech recognition , 2005, IEEE Signal Processing Magazine.

[57]  Hui Lin,et al.  OOV detection by joint word/phone lattice alignment , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[58]  Matthew Turk,et al.  Multimodal Human-Computer Interaction , 2005 .

[59]  Henry Lieberman,et al.  A model of textual affect sensing using real-world knowledge , 2003, IUI '03.

[60]  Herbert Jaeger,et al.  The''echo state''approach to analysing and training recurrent neural networks , 2001 .

[61]  Samy Bengio,et al.  Discriminative keyword spotting , 2009, Speech Commun..

[62]  A. Nakamura,et al.  Nature (London , 1975 .

[63]  Geoffrey E. Hinton,et al.  A time-delay neural network architecture for isolated word recognition , 1990, Neural Networks.

[64]  Hervé Bourlard,et al.  Enhanced Phone Posteriors for Improving Speech Recognition Systems , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[65]  Lakhmi C. Jain,et al.  Introduction to Bayesian Networks , 2008 .

[66]  Jürgen Schmidhuber,et al.  An Application of Recurrent Neural Networks to Discriminative Keyword Spotting , 2007, ICANN.

[67]  Daniel P. W. Ellis,et al.  Tandem acoustic modeling in large-vocabulary recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[68]  Michael Weintraub,et al.  Keyword-spotting using SRI's DECIPHER large-vocabulary speech-recognition system , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[69]  Björn W. Schuller,et al.  Robust in-car spelling recognition - a tandem BLSTM-HMM approach , 2009, INTERSPEECH.

[70]  Björn W. Schuller,et al.  Spoken term detection with Connectionist Temporal Classification: A novel hybrid CTC-DBN decoder , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.