Phoneme-specific speech separation

Speech separation or enhancement algorithms seldom exploit information about phoneme identities. In this study, we propose a novel phoneme-specific speech separation method. Rather than training a single global model to enhance all the frames, we train a separate model for each phoneme to process its corresponding frames. A robust ASR system is employed to identify the phoneme identity of each frame. This way, the information from ASR systems and language models can directly influence speech separation by selecting a phoneme-specific model to use at the test stage. In addition, phoneme-specific models have fewer variations to model and do not exhibit the data imbalance problem. The improved enhancement results can in turn help recognition. Experiments on the corpus of the second CHiME speech separation and recognition challenge (task-2) demonstrate the effectiveness of this method in terms of objective measures of speech intelligibility and quality, as well as recognition performance.

[1]  Jun Du,et al.  Joint training of front-end and back-end deep neural networks for robust speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Zhong-Qiu Wang,et al.  Joint training of speech separation, filterbank and acoustic model for robust automatic speech recognition , 2015, INTERSPEECH.

[3]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[4]  Yongqiang Wang,et al.  An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Bhiksha Raj,et al.  Phoneme-Dependent NMF for Speech Enhancement in Monaural Mixtures , 2011, INTERSPEECH.

[6]  DeLiang Wang,et al.  Improving Robustness of Deep Neural Network Acoustic Models via Speech Separation and Joint Adaptive Training , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  DeLiang Wang,et al.  Towards Scaling Up Classification-Based Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  DeLiang Wang,et al.  Deep Neural Network Based Supervised Speech Segregation Generalizes to Novel Noises through Large-scale Training , 2015 .

[9]  Jon Barker,et al.  The second ‘CHiME’ speech separation and recognition challenge: An overview of challenge systems and outcomes , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[10]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[11]  Yuxuan Wang,et al.  Time-frequency masking for large scale robust speech recognition , 2015, INTERSPEECH.

[12]  Yifan Gong,et al.  An Overview of Noise-Robust Automatic Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Guy J. Brown,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2006 .

[14]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Björn W. Schuller,et al.  Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR , 2015, LVA/ICA.

[16]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[17]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[19]  DeLiang Wang,et al.  A Feature Study for Classification-Based Speech Separation at Low Signal-to-Noise Ratios , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20]  G. K.,et al.  Learning Spectral Mapping for Speech Dereverberation and Denoising , 2017 .

[21]  John H. L. Hansen,et al.  Leveraging automatic speech recognition in cochlear implants for improved speech intelligibility under reverberation , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  DeLiang Wang,et al.  Ideal ratio mask estimation using deep neural networks for robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Richard M. Stern,et al.  Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24]  DeLiang Wang,et al.  Exploring Monaural Features for Classification-Based Speech Segregation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Tatsuya Kawahara,et al.  Speech dereverberation using long short-term memory , 2015, INTERSPEECH.

[26]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[28]  Arun Narayanan,et al.  Large-scale, sequence-discriminative, joint adaptive training for masking-based robust ASR , 2015, INTERSPEECH.

[29]  Tatsuya Kawahara,et al.  Deep autoencoders augmented with phone-class feature for reverberant speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Paris Smaragdis,et al.  A Non-negative Approach to Language Informed Speech Separation , 2012, LVA/ICA.

[31]  DeLiang Wang,et al.  Investigation of Speech Separation as a Front-End for Noise Robust Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[32]  Jonathan Le Roux,et al.  Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Björn W. Schuller,et al.  Discriminatively trained recurrent neural networks for single-channel speech separation , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).