Data Augmentation Using Multi-Input Multi-Output Source Separation for Deep Neural Network Based Acoustic Modeling

We investigate the use of local Gaussian modeling (LGM) based source separation to improve speech recognition accuracy. Previous studies have shown that the LGM based source separation technique has been successfully applied to the runtime speech enhancement and the speech enhancement of training data for deep neural network (DNN) based acoustic modeling. In this paper, we propose a data augmentation method utilizing the multi-input multi-output (MIMO) characteristic of LGM based source separation. We first investigate the difference between unprocessed multi-microphone signals and multi-channel output signals from LGM based source separation as augmented training data for DNN based acoustic modeling. Experimental results using the third CHiME challenge dataset show that the proposed data augmentation outperforms the conventional data augmentation. In addition, we experiment the beamforming applied to the source separated signals as runtime speech enhancement. The results show that the proposed runtime beamforming further improves the speech recognition accuracy.

[1]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[2]  Mingjiang Wang,et al.  Speech enhancement for nonstationary noise environments , 2017, 2017 IEEE 17th International Conference on Communication Technology (ICCT).

[3]  Jonathan Le Roux,et al.  The MERL/SRI system for the 3RD CHiME challenge using beamforming, robust feature extraction, and advanced speech recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[4]  Masahito Togami,et al.  Unified ASR system using LGM-based source separation, noise-robust feature extraction, and word hypothesis selection , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[5]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[6]  Naoyuki Kanda,et al.  Elastic spectral distortion for low resource speech recognition with deep neural networks , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[7]  Xavier Anguera Miró,et al.  Acoustic Beamforming for Speaker Diarization of Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Chengzhu Yu,et al.  The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[9]  Jacob Benesty,et al.  Speech Enhancement , 2010 .

[10]  Maurizio Omologo,et al.  Use of the crosspower-spectrum phase in acoustic event location , 1997, IEEE Trans. Speech Audio Process..

[11]  Don H. Johnson,et al.  Array Signal Processing: Concepts and Techniques , 1993 .

[12]  Masahito Togami,et al.  Online mvbf adaptation under diffuse noise environments with mimo based noise pre-filtering , 2012, 2012 11th International Conference on Information Science, Signal Processing and their Applications (ISSPA).

[13]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[14]  Rémi Gribonval,et al.  Under-Determined Reverberant Audio Source Separation Using a Full-Rank Spatial Covariance Model , 2009, IEEE Transactions on Audio, Speech, and Language Processing.