A long, deep and wide artificial neural net for robust speech recognition in unknown noise

A long deep and wide artificial neural net (LDWNN) with multiple ensemble neural nets for individual frequency subbands is proposed for robust speech recognition in unknown noise. It is assumed that the effect of arbitrary additive noise on speech recognition can be approximated by white noise (or speech-shaped noise) of similar level across multiple frequency subbands. The ensemble neural nets are trained in clean and speech-shaped noise at 20, 10, and 5 dB SNR to accommodate noise of different levels, followed by a neural net trained to select the most suitable neural net for optimum information extraction within a frequency subband. The posteriors from multiple frequency subbands are fused by another neural net to give a more reliable estimation. Experimental results show that the subband ensemble net adapts well to unknow noise.

[1]  Hynek Hermansky,et al.  Temporal envelope compensation for robust phoneme recognition using modulation spectrum. , 2010, The Journal of the Acoustical Society of America.

[2]  Jont B. Allen How do humans process and recognize speech , 1993 .

[3]  Hynek Hermansky,et al.  Exploiting contextual information for improved phoneme recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[5]  Daniel P. W. Ellis,et al.  LP-TRAP: linear predictive temporal patterns , 2004, INTERSPEECH.

[6]  Daniel P. W. Ellis,et al.  Autoregressive Modeling of Temporal Envelopes , 2007, IEEE Transactions on Signal Processing.

[7]  Li Deng,et al.  Large-vocabulary speech recognition under adverse acoustic environments , 2000, INTERSPEECH.

[8]  Feipeng Li Subband hybrid feature for multi-stream speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Hervé Bourlard,et al.  A mew ASR approach based on independent processing and recombination of partial frequency bands , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[10]  Richard Lippmann,et al.  Neural Network Classifiers Estimate Bayesian a posteriori Probabilities , 1991, Neural Computation.

[11]  Hynek Hermansky,et al.  Phone recognition in critical bands using sub-band temporal modulations , 2012, INTERSPEECH.

[12]  Hynek Hermansky,et al.  Multistream Recognition of Speech: Dealing With Unknown Unknowns , 2013, Proceedings of the IEEE.

[13]  Sangita R. Sharma,et al.  Multi-stream approach to robust speech recognition , 1999 .

[14]  Hynek Hermansky,et al.  Effect of filter bandwidth and spectral sampling rate of analysis filterbank on automatic phoneme recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  E. A. Martin,et al.  Multi-style training for robust isolated-word speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  HYNEK HERMANSKY,et al.  Speech recognition from spectral dynamics , 2011 .