Tandem decoding of children's speech for keyword detection in a child-robot interaction scenario

In this article, we focus on keyword detection in children's speech as it is needed in voice command systems. We use the FAU Aibo Emotion Corpus which contains emotionally colored spontaneous children's speech recorded in a child-robot interaction scenario and investigate various recent keyword spotting techniques. As the principle of bidirectional Long Short-Term Memory (BLSTM) is known to be well-suited for context-sensitive phoneme prediction, we incorporate a BLSTM network into a Tandem model for flexible coarticulation modeling in children's speech. Our experiments reveal that the Tandem model prevails over a triphone-based Hidden Markov Model approach.

[1]  Jürgen Schmidhuber,et al.  Learning Complex, Extended Sequences Using the Principle of History Compression , 1992, Neural Computation.

[2]  Kornel Laskowski,et al.  Combining Efforts for Improving Automatic Classification of Emotional User States , 2006 .

[3]  Hynek Hermansky,et al.  Multi-resolution RASTA filtering for TANDEM-based ASR , 2005, INTERSPEECH.

[4]  B. Schuller,et al.  Switching Linear Dynamic Models for Recognition of Emotionally Colored and Noisy Speech , 2010, Sprachkommunikation.

[5]  Björn W. Schuller,et al.  Emotion recognition from speech: Putting ASR in the loop , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Michael Picheny,et al.  Improvements in children's speech recognition performance , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[7]  Jürgen Schmidhuber,et al.  An Application of Recurrent Neural Networks to Discriminative Keyword Spotting , 2007, ICANN.

[8]  Björn W. Schuller,et al.  Recognition of spontaneous conversational speech using long short-term memory phoneme predictions , 2010, INTERSPEECH.

[9]  Herbert Jaeger,et al.  The''echo state''approach to analysing and training recurrent neural networks , 2001 .

[10]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[11]  Dirk Heylen,et al.  Towards responsive Sensitive Artificial Listeners , 2008 .

[12]  Björn W. Schuller,et al.  On the Impact of Children's Emotional Speech on Acoustic and Language Models , 2010, EURASIP J. Audio Speech Music. Process..

[13]  Alex Graves,et al.  A Tandem BLSTM-DBN Architecture for Keyword Spotting with Enhanced Context Modeling , 2009, NOLISP 2009.

[14]  Ronald A. Cole,et al.  Highly accurate children's speech recognition for interactive reading tutors using subword units , 2007, Speech Commun..

[15]  Geoffrey Zweig,et al.  Exact alpha-beta computation in logarithmic space with application to MAP word graph construction , 2000, INTERSPEECH.

[16]  Björn W. Schuller,et al.  Recognising interest in conversational speech - comparing bag of frames and supra-segmental features , 2009, INTERSPEECH.

[17]  WöllmerMartin,et al.  Tandem decoding of children's speech for keyword detection in a child-robot interaction scenario , 2011 .

[18]  José L. Pérez-Córdoba,et al.  Histogram equalization of speech representation for robust speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[19]  Joakim Gustafson,et al.  Voice transformations for improving children²s speech recognition in a publicly available dialogue system , 2002, INTERSPEECH.

[20]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[21]  Jonathan Foote,et al.  An overview of audio information retrieval , 1999, Multimedia Systems.

[22]  Jeff A. Bilmes,et al.  Maximum mutual information based reduction strategies for cross-correlation based joint distributional modeling , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[23]  R. C. Rose,et al.  Keyword detection in conversational speech utterances using hidden Markov model based continuous speech recognition , 1995, Comput. Speech Lang..

[24]  Bhuvana Ramabhadran,et al.  Vocabulary independent spoken term detection , 2007, SIGIR.

[25]  Loïc Kessous,et al.  Whodunnit - Searching for the most important feature types signalling emotion-related user states in speech , 2011, Comput. Speech Lang..

[26]  Peter Tiño,et al.  Learning long-term dependencies in NARX recurrent neural networks , 1996, IEEE Trans. Neural Networks.

[27]  Samy Bengio,et al.  Discriminative keyword spotting , 2009, Speech Commun..

[28]  Yoshua Bengio,et al.  Markovian Models for Sequential Data , 2004 .

[29]  Samy Bengio,et al.  Posterior based keyword spotting with a priori thresholds , 2006, INTERSPEECH.

[30]  Björn W. Schuller,et al.  Robust discriminative keyword spotting for emotionally colored spontaneous speech using bidirectional LSTM networks , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[31]  Shrikanth S. Narayanan,et al.  Automatic speech recognition for children , 1997, EUROSPEECH.

[32]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[33]  Jürgen Schmidhuber,et al.  Sequence Labelling in Structured Domains with Hierarchical Recurrent Neural Networks , 2007, IJCAI.

[34]  Sarel van Vuuren,et al.  Relevance of time-frequency features for phonetic and speaker-channel classification , 2000, Speech Commun..

[35]  Geoffrey Zweig,et al.  The graphical models toolkit: An open source software system for speech and time-series processing , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[36]  Stuart J. Russell,et al.  Dynamic bayesian networks: representation, inference and learning , 2002 .

[37]  Daniel P. W. Ellis,et al.  Tandem acoustic modeling in large-vocabulary recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[38]  Björn W. Schuller,et al.  Robust in-car spelling recognition - a tandem BLSTM-HMM approach , 2009, INTERSPEECH.

[39]  B. Repp Some observations on the development of anticipatory coarticulation. , 1986, The Journal of the Acoustical Society of America.

[40]  Geoffrey E. Hinton,et al.  A time-delay neural network architecture for isolated word recognition , 1990, Neural Networks.

[41]  Hervé Bourlard,et al.  Enhanced Phone Posteriors for Improving Speech Recognition Systems , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[42]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[43]  Jürgen Schmidhuber,et al.  Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition , 2005, ICANN.

[44]  Jürgen Schmidhuber,et al.  Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks , 2007, NIPS.

[45]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[46]  C. Mayo,et al.  The influence of phonemic awareness development on acoustic cue weighting strategies in children's speech perception. , 2003, Journal of speech, language, and hearing research : JSLHR.

[47]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[48]  Steffen Udluft,et al.  Learning Long Term Dependencies with Recurrent Neural Networks , 2006, ICANN.

[49]  Björn Schuller,et al.  Being bored? Recognising natural interest by extensive audiovisual integration for real-life application , 2009, Image Vis. Comput..

[50]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[51]  Björn W. Schuller,et al.  Combining Long Short-Term Memory and Dynamic Bayesian Networks for Incremental Emotion-Sensitive Artificial Listening , 2010, IEEE Journal of Selected Topics in Signal Processing.

[52]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[53]  Björn W. Schuller,et al.  Robust vocabulary independent keyword spotting with graphical models , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[54]  Marcus Liwicki,et al.  A novel approach to on-line handwriting recognition based on bidirectional long short-term memory networks , 2007 .

[55]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[56]  Diego Giuliani,et al.  Investigating recognition of children's speech , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[57]  Shrikanth S. Narayanan,et al.  Creating conversational interfaces for children , 2002, IEEE Trans. Speech Audio Process..

[58]  Lakhmi C. Jain,et al.  Introduction to Bayesian Networks , 2008 .

[59]  Frantisek Grézl,et al.  Optimizing bottle-neck features for lvcsr , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[60]  Björn W. Schuller,et al.  A multidimensional dynamic time warping algorithm for efficient multimodal fusion of asynchronous data streams , 2009, Neurocomputing.

[61]  Björn W. Schuller,et al.  Does affect affect automatic recognition of children2s speech? , 2008, WOCCI.

[62]  A. Graves,et al.  Unconstrained Online Handwriting Recognition with Recurrent Neural Networks , 2007 .

[63]  Gérard Chollet,et al.  Confidence measures for keyword spotting using support vector machines , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[64]  Alex Graves,et al.  Connectionist Temporal Classification , 2012 .

[65]  Samy Bengio,et al.  An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition , 2002, NIPS.

[66]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[67]  J.A. Bilmes,et al.  Graphical model architectures for speech recognition , 2005, IEEE Signal Processing Magazine.

[68]  Jeff A. Bilmes,et al.  A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[69]  Ronald J. Williams,et al.  Gradient-based learning algorithms for recurrent networks and their computational complexity , 1995 .

[70]  J. Bilmes Gaussian Models in Automatic Speech Recognition , 2008 .

[71]  Richard Rose,et al.  A hidden Markov model based keyword recognition system , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[72]  Trevor Darrell,et al.  Hidden Conditional Random Fields , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[73]  Shrikanth S. Narayanan,et al.  Analyzing Children's Speech: An Acoustic Study of Consonants and Consonant-Vowel Transition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[74]  Stefan Steidl,et al.  Automatic classification of emotion related user states in spontaneous children's speech , 2009 .

[75]  Jeff A. Bilmes,et al.  Graphical models and automatic speech recognition , 2002 .

[76]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.