Auditory features for the close talk speech enhancement with parameter masks

The speech segregation and enhancement is a hard task in speech communication. In order to get the clean target speech, a close talk system is used to collect the speech with a nearby microphone. A deep neural networks (DNN) estimator is used in a frequency channel for speech energy calculation with parameter masks. The adjusted binaural auditory features are used as the main input for DNN speech energy estimation. The energy difference between the two microphones is used as the main binaural auditory feature. The time difference is also used as the comparison feature. Experiments show the energy difference feature can get the similar performance to the combination two microphones monaural and binaural auditory features with limited calculation complexity. The two microphones energy difference feature is one of the key features in close talk speech enhancement.

[1]  Xiao Chen,et al.  Performance Evaluation of a Gammatone Filterbank for the Embedded System , 2013 .

[2]  DeLiang Wang,et al.  Exploring Monaural Features for Classification-Based Speech Segregation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Steven van de Par,et al.  A Binaural Scene Analyzer for Joint Localization and Recognition of Speakers in the Presence of Interfering Noise Sources and Reverberation , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  DeLiang Wang,et al.  Speech segregation based on sound localization , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[5]  Yi Jiang,et al.  Binaural Classification for Reverberant Speech Segregation Using Deep Neural Networks , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  Yi Jiang,et al.  Using energy difference for speech separation of dual-microphone close-talk system , 2013 .

[7]  Yi Jiang,et al.  Binaural deep neural network classification for reverberant speech segregation , 2014, INTERSPEECH.

[8]  Björn W. Schuller,et al.  Feature enhancement by deep LSTM networks for ASR in reverberant multisource environments , 2014, Comput. Speech Lang..

[9]  Daniel P. W. Ellis,et al.  Model-Based Expectation-Maximization Source Separation and Localization , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  DeLiang Wang,et al.  The role of binary mask patterns in automatic speech recognition in background noise. , 2013, The Journal of the Acoustical Society of America.