Deep bottleneck features and sound-dependent i-vectors for simultaneous recognition of speech and environmental sounds

In speech interfaces, it is often necessary to understand the overall auditory environment, not only recognizing what is being said, but also being aware of the location or actions surrounding the utterance. However, automatic speech recognition (ASR) becomes difficult when recognizing speech with environmental sounds. Standard solutions treat environmental sounds as noise, and remove them to improve ASR performance. On the other hand, most studies on environmental sounds construct classifiers for environmental sounds only, without interference of spoken utterances. But, in reality, such separate situations almost never exist. This study attempts to address the problem of simultaneous recognition of speech and environmental sounds. Particularly, we examine the possibility of using deep neural network (DNN) techniques to recognize speech and environmental sounds simultaneously, and improve the accuracy of both tasks under respective noisy conditions. First, we investigate DNN architectures including two parallel single-task DNNs, and a single multi-task DNN. However, we found direct multi-task learning of simultaneous speech and environmental recognition to be difficult. Therefore, we further propose a method that combines bottleneck features and sound-dependent i-vectors within this framework. Experimental evaluation results reveal that the utilizing bottleneck features and i-vectors as the input of DNNs can help to improve accuracy of each recognition task.

[1]  Richard F. Lyon,et al.  Machine Hearing: An Emerging Field [Exploratory DSP] , 2010, IEEE Signal Processing Magazine.

[2]  Yajie Miao,et al.  Kaldi+PDNN: Building DNN-based ASR Systems with Kaldi and PDNN , 2014, ArXiv.

[3]  Yifan Gong,et al.  Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Yu Zhang,et al.  Speaker adaptation using the i-vector technique for bottleneck features , 2015, INTERSPEECH.

[5]  Onur Dikmen,et al.  Sound event detection using non-negative dictionaries learned from annotated overlapping events , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[6]  Patrick Kenny,et al.  Joint Factor Analysis of Speaker and Session Variability: Theory and Algorithms , 2006 .

[7]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[8]  E. C. Cmm,et al.  on the Recognition of Speech, with , 2008 .

[9]  Mark J. F. Gales,et al.  Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[10]  Cheung-Chi Leung,et al.  Joint acoustic modeling of triphones and trigraphemes by multi-task learning deep neural networks for low-resource speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[12]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[13]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[14]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Simon King,et al.  Deep neural networks employing Multi-Task Learning and stacked bottleneck features for speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[17]  Marcel Gabrea Robust adaptive Kalman filtering-based speech enhancement algorithm , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[19]  Jan Cernocký,et al.  Probabilistic and Bottle-Neck Features for LVCSR of Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[20]  Lemmietta G. McNeilly,et al.  American Speech-Language-Hearing Association (ASHA) , 2017 .

[21]  Jasha Droppo,et al.  Multi-task learning in deep neural networks for improved phoneme recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  T. Hirahara,et al.  One, two, many—Judging the number of concurrent talkers. , 1996 .

[23]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24]  George Saon,et al.  Speaker adaptation of neural network acoustic models using i-vectors , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[25]  Tran Huy Dat,et al.  Generalized Gaussian Distribution Kullback-Leibler kernel for robust sound event recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Richard F. Lyon,et al.  Machine Hearing: An Emerging Field , 2010 .

[27]  Ning Ma,et al.  The PASCAL CHiME speech separation and recognition challenge , 2013, Comput. Speech Lang..

[28]  Georg Heigold,et al.  Multilingual acoustic models using distributed deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[30]  Yan Song,et al.  Robust sound event recognition using convolutional neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Julien Pinquier,et al.  A multi-band spectral subtraction method for enhancing speech corrupted by colored noise , 2002 .

[32]  Hitoshi Isahara,et al.  Spontaneous Speech Corpus of Japanese , 2000, LREC.

[33]  Rainer Martin,et al.  Speech enhancement using MMSE short time spectral estimation with gamma distributed speech priors , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[34]  Dimitrios Dimitriadis,et al.  Investigating factor analysis features for deep neural networks in noisy speech recognition , 2015, INTERSPEECH.

[35]  Heikki Huttunen,et al.  Polyphonic sound event detection using multi label deep neural networks , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[36]  Tomohiro Nakatani,et al.  The reverb challenge: A common evaluation framework for dereverberation and recognition of reverberant speech , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[37]  S. Khanna Hearing and balance , 1997, Nature.

[38]  Patrick Kenny,et al.  Eigenvoice modeling with sparse training data , 2005, IEEE Transactions on Speech and Audio Processing.

[39]  Ramesh A. Gopinath,et al.  Maximum likelihood modeling with Gaussian distributions for classification , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[40]  R. Maas,et al.  A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing research , 2016, EURASIP Journal on Advances in Signal Processing.

[41]  Andrew W. Senior,et al.  Improving DNN speaker independence with I-vector inputs , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).