Dual-channel VTS feature compensation for noise-robust speech recognition on mobile devices

One way to improve automatic speech recognition (ASR) performance on the latest mobile devices, which can be employed on a variety of noisy environments, consists of taking advantage of the small microphone arrays embedded in them. Since the performance of the classic beamforming techniques with small microphone arrays is rather limited, specific techniques are being developed to efficiently exploit this novel feature for noise-robust ASR purposes. In this study, a novel dual-channel minimum mean square error-based feature compensation method relying on a vector Taylor series (VTS) expansion of a dual-channel speech distortion model is proposed. In contrast to the single-channel VTS approach (which can be considered as the state-of-the-art for feature compensation), the authors’ technique particularly benefits from the spatial properties of speech and noise. Their proposal is assessed on a dual-microphone smartphone (a particular case of interest) by means of the AURORA2-2C synthetic corpus. Word recognition results, also validated with real noisy speech data, demonstrate the higher accuracy of their method by clearly outperforming minimum variance distortionless response beamforming and a single-channel VTS feature compensation approach, especially at low signal-to-noise ratios.

[1]  Zhong-Hua Fu,et al.  Dual-microphone noise reduction for mobile phone application , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[3]  Ángel M. Gómez,et al.  Feature enhancement for robust speech recognition on smartphones with dual-microphone , 2014, 2014 22nd European Signal Processing Conference (EUSIPCO).

[4]  Akihiko Sugiyama,et al.  A new generalized sidelobe canceller with a compact array of microphones suitable for mobile terminals , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Ning Ma,et al.  MMSE-Based Missing-Feature Reconstruction With Temporal Modeling for Robust Speech Recognition , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Alex Acero,et al.  Sound capture system and spatial filter for small devices , 2008, INTERSPEECH.

[7]  Hugo Van hamme,et al.  Exemplar-based speech enhancement for deep neural network based automatic speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Ivan Tashev,et al.  Microphone Array for Headset with Spatial Noise Suppressor , 2005 .

[9]  Richard M. Stern,et al.  A vector Taylor series approach for environment-independent speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[10]  Ángel M. Gómez,et al.  A Deep Neural Network Approach for Missing-Data Mask Estimation on Dual-Microphone Smartphones: Application to Noise-Robust Speech Recognition , 2014, IberSPEECH.

[11]  James R. Glass,et al.  Updated Minds Report on Speech Recognition and Understanding, Part 2 Citation Baker, J. Et Al. " Updated Minds Report on Speech Recognition and Understanding, Part 2 [dsp Education]. " Signal Processing Accessed Terms of Use , 2022 .

[12]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[13]  Hugo Van hamme,et al.  Model-based feature enhancement with uncertainty decoding for noise robust ASR , 2006, Speech Commun..

[14]  Yonghong Yan,et al.  A fast two-microphone noise reduction algorithm based on power level ratio for mobile phone , 2012, 2012 8th International Symposium on Chinese Spoken Language Processing.

[15]  Pedro J. Moreno,et al.  Speech recognition in noisy environments , 1996 .

[16]  Friedrich Faubel,et al.  On expectation maximization based channel and noise estimation beyond the vector Taylor series expansion , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Li Deng,et al.  HMM adaptation using vector taylor series for noisy speech recognition , 2000, INTERSPEECH.

[18]  Zbynek Koldovský,et al.  Noise reduction in dual-microphone mobile phones using a bank of pre-measured target-cancellation filters , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  B. Atal Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. , 1974, The Journal of the Acoustical Society of America.

[20]  Antonio M. Peinado,et al.  Model-based compensation of the additive noise for continuous speech recognition. experiments using the Aurora II database and tasks , 2001, INTERSPEECH.

[21]  Ángel M. Gómez,et al.  Efficient MMSE Estimation and Uncertainty Processing for Multienvironment Robust Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Steven Wegmann,et al.  On the importance of modeling and robustness for deep neural network feature , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Ahmad Akbari,et al.  Using power level difference for near field dual-microphone speech enhancement , 2009 .

[24]  X. Mestre,et al.  On diagonal loading for minimum variance beamformers , 2003, Proceedings of the 3rd IEEE International Symposium on Signal Processing and Information Technology (IEEE Cat. No.03EX795).