Robust speech recognition using temporal masking and thresholding algorithm

In this paper, we present a new dereverberation algorithm called Temporal Masking and Thresholding (TMT) to enhance the temporal spectra of spectral features for robust speech recognition in reverberant environments. This algorithm is motivated by the precedence effect and temporal masking of human auditory perception. This work is an improvement of our previous dereverberation work called Suppression of Slowlyvarying components and the falling edge of the power envelope (SSF). The TMT algorithm uses a different mathematical model to characterize temporal masking and thresholding compared to the model that had been used to characterize the SSF algorithm. Specifically, the nonlinear highpass filtering used in the SSF algorithm has been replaced by a masking mechanism based on a combination of peak detection and dynamic thresholding. Speech recognition results show that the TMT algorithm provides superior recognition accuracy compared to other algorithms such as LTLSS, VTS, or SSF in reverberant environments.

[1]  Richard M. Stern,et al.  A vector Taylor series approach for environment-independent speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[2]  Richard M. Stern,et al.  Two-microphone source separation algorithm based on statistical modeling of angle distributions , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  John H. L. Hansen,et al.  A new perceptually motivated MVDR-based acoustic front-end (PMVDR) for robust automatic speech recognition , 2008, Speech Commun..

[4]  Richard M. Stern,et al.  Power function-based power distribution normalization algorithm for robust speech recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[5]  Richard M. Stern,et al.  Robust speech recognition using a Small Power Boosting algorithm , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[6]  Vincent Vanhoucke,et al.  Improving the speed of neural networks on CPUs , 2011 .

[7]  Richard M. Stern,et al.  Physiologically-motivated synchrony-based processing for robust automatic speech recognition , 2006, INTERSPEECH.

[8]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[9]  Hyung-Min Park,et al.  Non-stationary sound source localization based on zero crossings with the detection of onset intervals , 2008, IEICE Electron. Express.

[10]  Richard M. Stern,et al.  Automatic selection of thresholds for signal separation algorithms based on interaural delay , 2010, INTERSPEECH.

[11]  Richard M. Stern,et al.  Delta-spectral cepstral coefficients for robust speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Richard M. Stern,et al.  Signal Processing for Robust Speech Recognition , 1994, HLT.

[13]  Richard M. Stern,et al.  Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Chanwoo Kim,et al.  Robust DTW-based recognition algorithm for hand-held consumer devices , 2005, IEEE Transactions on Consumer Electronics.

[15]  Richard M. Stern,et al.  Binaural sound source separation motivated by auditory processing , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Keith D. Martin Echo suppression in a computational model of the precedence effect , 1997, Proceedings of 1997 Workshop on Applications of Signal Processing to Audio and Acoustics.

[17]  Patrick M. Zurek,et al.  The Precedence Effect , 1987 .

[18]  R. Plomp,et al.  Effect of reducing slow temporal modulations on speech reception. , 1994, The Journal of the Acoustical Society of America.

[19]  Richard M. Stern,et al.  Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[21]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[22]  Richard M. Stern,et al.  Signal separation for robust speech recognition based on phase difference information obtained in the frequency domain , 2009, INTERSPEECH.

[23]  Wonyong Sung,et al.  A Robust Formant Extraction Algorithm Combining Spectral Peak Picking and Root Polishing , 2006, EURASIP J. Adv. Signal Process..

[24]  Hyung-Min Park,et al.  Binaural and Multiple-Microphone Signal Processing Motivated by Auditory Perception , 2008, 2008 Hands-Free Speech Communication and Microphone Arrays.

[25]  Richard M. Stern,et al.  Nonlinear enhancement of onset for robust speech recognition , 2010, INTERSPEECH.