论文信息 - An experimental study on joint modeling of mixed-bandwidth data via deep neural networks for robust speech recognition

An experimental study on joint modeling of mixed-bandwidth data via deep neural networks for robust speech recognition

We propose joint modeling strategies leveraging upon large-scale mixed-band training speech for recognition of both narrowband and wideband data based on deep neural networks (DNNs). We utilize conventional down-sampling and up-sampling schemes to go between narrowband and wideband data. We also explore DNN-based speech bandwidth expansion (BWE) to map some acoustic features from narrowband to wideband speech. By arranging narrowband and wideband features at the input or the output level of BWE-DNN, and combining down-sampling and up-sampling data, different DNNs can be established. Our experiments on a Mandarin speech recognition task show that the hybrid DNNs for joint modeling of mixed-band speech yield significant performance gains over both the narrowband and wideband speech models, well-trained separately, with a relative character error rate reduction of 7.9% and 3.9% on narrowband and wideband data, respectively. Furthermore, the proposed strategies also consistently outperform other conventional DNN-based methods.

[1] Alex Acero,et al. Training Wideband Acoustic Models Using Mixed-Bandwidth Training Data for Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[2] Jun Du,et al. Robust speech recognition with speech enhanced deep neural networks , 2014, INTERSPEECH.

[3] Rong Zheng,et al. Asynchronous stochastic gradient descent for DNN training , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4] DeLiang Wang,et al. Joint noise adaptive training for robust automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Hyung Soon Kim,et al. Narrowband to wideband conversion of speech using GMM based transformation , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[6] Marco Wiering,et al. 2011 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN) , 2011, IJCNN 2011.

[7] Chin-Hui Lee,et al. A deep neural network approach to speech bandwidth expansion , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] Stephen Cox,et al. Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[9] Geun-Bae Song,et al. A study of HMM-based bandwidth extension of speech signals , 2009, Signal Process..

[10] Bin Liu,et al. A novel method of artificial bandwidth extension using deep architecture , 2015, INTERSPEECH.

[11] Peter Jax,et al. Artificial bandwidth extension of speech signals using MMSE estimation based on a hidden Markov model , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[12] Shenghui Zhao,et al. Speech bandwidth expansion based on deep neural networks , 2015, INTERSPEECH.

[13] Jun Du,et al. An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[14] Chin-Hui Lee,et al. DNN-based speech bandwidth expansion and its application to adding high-frequency missing features for automatic speech recognition of narrowband speech , 2015, INTERSPEECH.

[15] Yifan Gong,et al. Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[16] Gautham J. Mysore,et al. Language informed bandwidth expansion , 2012, 2012 IEEE International Workshop on Machine Learning for Signal Processing.

[17] Richard M. Stern,et al. Sources of degradation of speech recognition in the telephone network , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[18] Jun Du,et al. Joint training of front-end and back-end deep neural networks for robust speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Alex Acero,et al. Robust bandwidth extension of noise-corrupted narrowband speech , 2005, INTERSPEECH.

[20] Jonathan G. Fiscus,et al. Tools for the analysis of benchmark speech recognition tests , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[21] Yee Whye Teh,et al. A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[22] Bo Xu,et al. Improving wideband acoustic models using mixed-bandwidth training data via DNN adaptation , 2014, INTERSPEECH.

[23] Douglas D. O'Shaughnessy,et al. Statistical recovery of wideband speech from narrowband speech , 1992, IEEE Trans. Speech Audio Process..

[24] Frank K. Soong,et al. A maximum a Posterior-based reconstruction approach to speech bandwidth expansion in noise , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).