A single microphone noise reduction algorithm based on the detection and reconstruction of spectro-temporal features

Animals throughout the animal kingdom excel at extracting individual sounds from competing background sounds, yet current state-of-the-art signal processing algorithms struggle to process speech in the presence of even modest background noise. Recent psychophysical experiments in humans and electrophysiological recordings in animal models suggest that the brain is adapted to process sounds within the restricted domain of spectro-temporal modulations found in natural sounds. Here, we describe a novel single microphone noise reduction algorithm called spectro-temporal detection–reconstruction (STDR) that relies on an artificial neural network trained to detect, extract and reconstruct the spectro-temporal features found in speech. STDR can significantly reduce the level of the background noise while preserving the foreground speech quality and improving estimates of speech intelligibility. In addition, by leveraging the strong temporal correlations present in speech, the STDR algorithm can also operate on predictions of upcoming speech features, retaining similar performance levels while minimizing inherent throughput delays. STDR performs better than a competing state-of-the-art algorithm for a wide range of signal-to-noise ratios and has the potential for real-time applications such as hearing aids and automatic speech recognition.

[1]  C. Summerfield,et al.  Expectation in perceptual decision making: neural and computational mechanisms , 2014, Nature Reviews Neuroscience.

[2]  N. Mesgarani,et al.  Selective cortical representation of attended speaker in multi-talker speech perception , 2012, Nature.

[3]  K. Stevens,et al.  On the Properties of Voiceless Fricative Consonants , 1961 .

[4]  Björn W. Schuller,et al.  Discriminatively trained recurrent neural networks for single-channel speech separation , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[5]  J. Simon,et al.  Emergence of neural encoding of auditory objects while listening to competing speakers , 2012, Proceedings of the National Academy of Sciences.

[6]  C. Atencio,et al.  Hierarchical representations in the auditory cortex , 2011, Current Opinion in Neurobiology.

[7]  Monty A Escabí,et al.  Neural Modulation Tuning Characteristics Scale to Efficiently Encode Natural Sound Statistics , 2010, The Journal of Neuroscience.

[8]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[9]  T. Houtgast,et al.  The concept of signal-to-noise ratio in the modulation domain and speech intelligibility. , 2008, The Journal of the Acoustical Society of America.

[10]  N. C. Singh,et al.  Modulation spectra of natural sounds and ethological theories of auditory processing. , 2003, The Journal of the Acoustical Society of America.

[11]  Lee M. Miller,et al.  Spectrotemporal receptive fields in the lemniscal auditory thalamus and cortex. , 2002, Journal of neurophysiology.

[12]  Yu Tsao,et al.  Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[13]  A M Liberman,et al.  Perception of the speech code. , 1967, Psychological review.

[14]  S. Shamma,et al.  Temporal coherence and attention in auditory scene analysis , 2011, Trends in Neurosciences.

[15]  D. Poeppel,et al.  Temporal context in speech processing and attentional stream selection: A behavioral and neural perspective , 2012, Brain and Language.

[16]  Torsten Dau,et al.  Predicting speech intelligibility based on the signal-to-noise envelope power ratio after modulation-frequency selective processing. , 2011, The Journal of the Acoustical Society of America.

[17]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[18]  Dennis H. Klatt,et al.  Prediction of perceived phonetic distance from critical-band spectra: A first step , 1982, ICASSP.

[19]  Rainer Martin,et al.  Noise power spectral density estimation based on optimal smoothing and minimum statistics , 2001, IEEE Trans. Speech Audio Process..

[20]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[21]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[22]  J. L. Flanagan,et al.  Parametric coding of speech spectra , 1980 .

[23]  Konrad P. Körding,et al.  Sparse Spectrotemporal Coding of Sounds , 2003, EURASIP J. Adv. Signal Process..

[24]  C. Schroeder,et al.  Low-frequency neuronal oscillations as instruments of sensory selection , 2009, Trends in Neurosciences.

[25]  Mounya Elhilali,et al.  A spectro-temporal modulation index (STMI) for assessment of speech intelligibility , 2003, Speech Commun..

[26]  Katherine I. Nagel,et al.  Organizing Principles of Spectro-Temporal Encoding in the Avian Primary Auditory Area Field L , 2008, Neuron.

[27]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[28]  Torsten Dau,et al.  The role of auditory spectro-temporal modulation filtering and the decision metric for speech intelligibility prediction. , 2014, The Journal of the Acoustical Society of America.

[29]  D. Poeppel,et al.  Mechanisms Underlying Selective Neuronal Tracking of Attended Speech at a “Cocktail Party” , 2013, Neuron.

[30]  A. Bronkhorst,et al.  A model for context effects in speech recognition. , 1993, The Journal of the Acoustical Society of America.

[31]  Yi Hu,et al.  Evaluation of Objective Quality Measures for Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[32]  Ruth Y Litovsky,et al.  Speech intelligibility and spatial release from masking in young children. , 2005, The Journal of the Acoustical Society of America.

[33]  D. J. Hermes,et al.  Spectro-temporal characterization of auditory neurons: Redundant or necessary? , 1981, Hearing Research.

[34]  Laurel H. Carney,et al.  Speech Enhancement for Listeners With Hearing Loss Based on a Model for Vowel Coding in the Auditory Midbrain , 2014, IEEE Transactions on Biomedical Engineering.

[35]  Frédéric E. Theunissen,et al.  The Modulation Transfer Function for Speech Intelligibility , 2009, PLoS Comput. Biol..

[36]  Frédéric E. Theunissen,et al.  Noise-invariant Neurons in the Avian Auditory Cortex: Hearing the Song in Noise , 2013, PLoS Comput. Biol..

[37]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[38]  Ann R. Bradlow,et al.  Language- and Talker-dependent Variation in Global Features of Native and Non-native Speech , 2011, ICPhS.

[39]  Nicole L. Carlson,et al.  Sparse Codes for Speech Predict Spectrotemporal Receptive Fields in the Inferior Colliculus , 2012, PLoS Comput. Biol..

[40]  Neil C. Rabinowitz,et al.  Constructing Noise-Invariant Representations of Sound in the Auditory Pathway , 2013, PLoS biology.

[41]  R. McAulay,et al.  Speech enhancement using a soft-decision noise suppression filter , 1980 .

[42]  Jen-Tzung Chien,et al.  Modulation Wiener filter for improving speech intelligibility , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[44]  Paris Smaragdis,et al.  Deep learning for monaural speech separation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45]  S. Soli,et al.  Development of the Hearing in Noise Test for the measurement of speech reception thresholds in quiet and in noise. , 1994, The Journal of the Acoustical Society of America.

[46]  L L Elliott,et al.  Development of a test of speech intelligibility in noise using sentence materials with controlled word predictability. , 1977, The Journal of the Acoustical Society of America.

[47]  John H. L. Hansen,et al.  An effective quality evaluation protocol for speech enhancement algorithms , 1998, ICSLP.

[48]  Jon Rigelsford,et al.  Handbook of Neural Networks for Speech Processing , 2003 .

[49]  DeLiang Wang,et al.  An algorithm to improve speech recognition in noise for hearing-impaired listeners. , 2013, The Journal of the Acoustical Society of America.

[50]  John-Paul Hosom,et al.  A review of research on speech intelligibility and correlations with acoustic features , 2011 .

[51]  Changchun Bao,et al.  Wiener filtering based speech enhancement with Weighted Denoising Auto-encoder and noise classification , 2014, Speech Commun..

[52]  M. Kathleen Pichora-Fuller,et al.  Use of supportive context by younger and older adult listeners: Balancing bottom-up and top-down information processing , 2008 .

[53]  G. A. Miller,et al.  The intelligibility of speech as a function of the context of the test materials. , 1951, Journal of experimental psychology.

[54]  Sarah M. N. Woolley,et al.  Sparse and Background-Invariant Coding of Vocalizations in Auditory Scenes , 2013, Neuron.

[55]  J. Mollick Neural and Computational Mechanisms of Reward and Aversion , 2017 .

[56]  Frédéric E Theunissen,et al.  Functional Groups in the Avian Auditory System , 2009, The Journal of Neuroscience.

[57]  Michael S. Lewicki,et al.  Efficient auditory coding , 2006, Nature.

[58]  Richard M. Stern,et al.  Hearing Is Believing: Biologically Inspired Methods for Robust Automatic Speech Recognition , 2012, IEEE Signal Processing Magazine.

[59]  Schuyler Quackenbush,et al.  Objective measures of speech quality , 1995 .

[60]  Saad Mneimneh,et al.  Crossing Over…Markov Meets Mendel , 2012, PLoS Comput. Biol..

[61]  Theunissen Frederic Hearing the song in noise , 2010 .

[62]  Sridha Sridharan,et al.  The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms , 2010, INTERSPEECH.

[63]  Lee M. Miller,et al.  Naturalistic Auditory Contrast Improves Spectrotemporal Coding in the Cat Inferior Colliculus , 2003, The Journal of Neuroscience.

[64]  Richard R. Fay AUDITORY SCENE ANALYSIS , 2008 .

[65]  DeLiang Wang,et al.  Ideal ratio mask estimation using deep neural networks for robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[66]  B. J. Balough A Contemporary Review of Hearing Aids , 2010 .

[67]  A. Clark Whatever next? Predictive brains, situated agents, and the future of cognitive science. , 2013, The Behavioral and brain sciences.

[68]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[69]  Jonathan Z. Simon,et al.  Stimulus-invariant processing and spectrotemporal reverse correlation in primary auditory cortex , 2005, Journal of Computational Neuroscience.

[70]  Björn W. Schuller,et al.  Single-channel speech separation with memory-enhanced recurrent neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[71]  K. Sen,et al.  Spectral-temporal Receptive Fields of Nonlinear Auditory Neurons Obtained Using Natural Sounds , 2022 .

[72]  Nima Mesgarani,et al.  Phoneme representation and classification in primary auditory cortex. , 2008, The Journal of the Acoustical Society of America.

[73]  H. Dillon,et al.  An international comparison of long‐term average speech spectra , 1994 .

[74]  Rainer Martin,et al.  Spectral Subtraction Based on Minimum Statistics , 2001 .

[75]  Jonathan Z. Simon,et al.  Adaptive Temporal Encoding Leads to a Background-Insensitive Cortical Representation of Speech , 2013, The Journal of Neuroscience.

[76]  Anne Hsu,et al.  Tuning for spectro-temporal modulations as a mechanism for auditory discrimination of natural sounds , 2005, Nature Neuroscience.

[77]  Julie E. Elie,et al.  Neural processing of natural sounds , 2014, Nature Reviews Neuroscience.

[78]  Richard F. Lyon,et al.  A computational model of filtering, detection, and compression in the cochlea , 1982, ICASSP.

[79]  Alex T. NELSONOregon Networks for Speech Enhancement , 1998 .

[80]  Brent Edwards,et al.  Hearing Aids and Hearing Impairment , 2004 .

[81]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[82]  Stephen V. David,et al.  Mechanisms of noise robust representation of speech in primary auditory cortex , 2014, Proceedings of the National Academy of Sciences.

[83]  Nima Mesgarani,et al.  Speech enhancement based on filtering the spectrotemporal modulations , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[84]  C. Eulitz,et al.  Top-down knowledge supports the retrieval of lexical information from degraded speech , 2007, Brain Research.