2D Psychoacoustic modeling of equivalent masking for automatic speech recognition

Noise robustness has long been one of the most important goals in speech recognition. While the performance of automatic speech recognition (ASR) deteriorates in noisy situations, the human auditory system is relatively adept at handling noise. To mimic this adeptness, we study and apply psychoacoustic models in speech recognition as a means to improve robustness of ASR systems. Psychoacoustic models are usually implemented in a subtractive manner with the intention to remove noise. However, this is not necessarily the only approach to this challenge. This paper presents a novel algorithm which implements psychoacoustic models additively. The algorithm is motivated by the fact that weak sound elements that are below the masking threshold are the same for the human auditory system, regardless of the actual sound pressure level. Another important contribution of our proposed algorithm is a superior implementation of masking effect. Only those sounds that fall below the masking threshold are modified, which better reflects physical masking effects. We give detailed experimental results showing relationships between the subtractive and additive approaches. Since all the parameters of the proposed filters are positive or zero, they are named 2D psychoacoustic P-filters. Detailed theoretical analysis is provided to show the noise removal ability of these filters. Experiments are carried out on the AURORA2 database. Experimental results show that the word recognition rate using our proposed feature extraction method has been effectively increased. Given models trained with clean speech, our proposed method achieves up to 84.23% word recognition on noisy data. HighlightsModeling of human auditory system.2D Psychoacoustic model based on equivalent masking.Comparison of different implementation styles of psychoacoustic models.A unified mathematical model combining different conditions.Detailed analysis is provided to study the algorithm performance.

[1]  Soo Ngee Koh,et al.  Improved noise suppression filter using self-adaptive estimator of probability of speech absence , 1999, Signal Process..

[2]  Dai Peng,et al.  Speech recognition based on front-end noise removal algorithms , 2014 .

[3]  Ing Yann Soon,et al.  A temporal frequency warped (TFW) 2D psychoacoustic filter for robust speech recognition system , 2012, Speech Commun..

[4]  Seiichi Nakagawa,et al.  Effect of acoustic and linguistic contexts on human and machine speech recognition , 2014, Comput. Speech Lang..

[5]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[6]  Otilia Kocsis,et al.  Context-adaptive pre-processing scheme for robust speech recognition in fast-varying noise environment , 2011, Signal Process..

[7]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[8]  C. Perfetti,et al.  Phonemic activation during the first 40 ms of word identification: Evidence from backward masking and priming , 1991 .

[9]  Andrew J. Oxenham,et al.  Effects of masker frequency and duration in forward masking: further evidence for the influence of peripheral nonlinearity , 2000, Hearing Research.

[10]  Soo-Young Lee,et al.  An engineering model of the masking for the noise-robust speech recognition , 2003, Neurocomputing.

[11]  Alfred Mertins,et al.  Automatic speech recognition and speech variability: A review , 2007, Speech Commun..

[12]  Jont B. Allen,et al.  How do humans process and recognize speech? , 1993, IEEE Trans. Speech Audio Process..

[13]  Abeer Alwan,et al.  A model of dynamic auditory perception and its application to robust word recognition , 1997, IEEE Trans. Speech Audio Process..

[14]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[15]  Ing Yann Soon,et al.  2D psychoacoustic filtering for robust speech recognition , 2009, 2009 7th International Conference on Information, Communications and Signal Processing (ICICS).

[16]  David Pearce,et al.  RTP Payload Formats for European Telecommunications Standards Institute (ETSI) European Standard ES 202 050, ES 202 211, and ES 202 212 Distributed Speech Recognition Encoding , 2005, RFC.

[17]  R. G. Leonard,et al.  A database for speaker-independent digit recognition , 1984, ICASSP.

[18]  Ing Yann Soon,et al.  Robust speech recognition by using spectral subtraction with noise peak shifting , 2013, IET Signal Process..

[19]  Brian Kingsbury,et al.  New types of deep neural network learning for speech recognition and related applications: an overview , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  M.G. Bellanger,et al.  Digital processing of speech signals , 1980, Proceedings of the IEEE.

[21]  O. Golan,et al.  Psychoacoustic abilities as predictors of vocal emotion recognition , 2013, Attention, perception & psychophysics.

[22]  W. Jesteadt,et al.  Forward masking as a function of frequency, masker level, and signal delay. , 1982, The Journal of the Acoustical Society of America.

[23]  Peter Vary,et al.  Noise suppression by spectral magnitude estimation —mechanism and theoretical limits— , 1985 .

[24]  Nathalie Virag,et al.  Single channel speech enhancement based on masking properties of the human auditory system , 1999, IEEE Trans. Speech Audio Process..

[25]  Daniel P. W. Ellis,et al.  Speech and Audio Signal Processing - Processing and Perception of Speech and Music, Second Edition , 1999 .

[26]  Björn W. Schuller,et al.  Feature enhancement by deep LSTM networks for ASR in reverberant multisource environments , 2014, Comput. Speech Lang..

[27]  BenzeghibaM.,et al.  Automatic speech recognition and speech variability , 2007 .

[28]  Huijun Ding,et al.  Speech enhancement in transform domain. , 2011 .

[29]  Ing Yann Soon,et al.  A temporal warped 2D psychoacoustic modeling for robust speech recognition system , 2011, Speech Commun..

[30]  Soo-Young Lee,et al.  Nonlinear spectro-temporal features based on a cochlear model for automatic speech recognition in a noisy situation , 2013, Neural Networks.

[31]  Jeff A. Bilmes,et al.  MVA Processing of Speech Features , 2007, IEEE Transactions on Audio, Speech, and Language Processing.