An experimental study on joint modeling of mixed-bandwidth data via deep neural networks for robust speech recognition

We propose joint modeling strategies leveraging upon large-scale mixed-band training speech for recognition of both narrowband and wideband data based on deep neural networks (DNNs). We utilize conventional down-sampling and up-sampling schemes to go between narrowband and wideband data. We also explore DNN-based speech bandwidth expansion (BWE) to map some acoustic features from narrowband to wideband speech. By arranging narrowband and wideband features at the input or the output level of BWE-DNN, and combining down-sampling and up-sampling data, different DNNs can be established. Our experiments on a Mandarin speech recognition task show that the hybrid DNNs for joint modeling of mixed-band speech yield significant performance gains over both the narrowband and wideband speech models, well-trained separately, with a relative character error rate reduction of 7.9% and 3.9% on narrowband and wideband data, respectively. Furthermore, the proposed strategies also consistently outperform other conventional DNN-based methods.

[1]  Alex Acero,et al.  Training Wideband Acoustic Models Using Mixed-Bandwidth Training Data for Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Jun Du,et al.  Robust speech recognition with speech enhanced deep neural networks , 2014, INTERSPEECH.

[3]  Rong Zheng,et al.  Asynchronous stochastic gradient descent for DNN training , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  DeLiang Wang,et al.  Joint noise adaptive training for robust automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Hyung Soon Kim,et al.  Narrowband to wideband conversion of speech using GMM based transformation , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[6]  Marco Wiering,et al.  2011 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN) , 2011, IJCNN 2011.

[7]  Chin-Hui Lee,et al.  A deep neural network approach to speech bandwidth expansion , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Stephen Cox,et al.  Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[9]  Geun-Bae Song,et al.  A study of HMM-based bandwidth extension of speech signals , 2009, Signal Process..

[10]  Bin Liu,et al.  A novel method of artificial bandwidth extension using deep architecture , 2015, INTERSPEECH.

[11]  Peter Jax,et al.  Artificial bandwidth extension of speech signals using MMSE estimation based on a hidden Markov model , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[12]  Shenghui Zhao,et al.  Speech bandwidth expansion based on deep neural networks , 2015, INTERSPEECH.

[13]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[14]  Chin-Hui Lee,et al.  DNN-based speech bandwidth expansion and its application to adding high-frequency missing features for automatic speech recognition of narrowband speech , 2015, INTERSPEECH.

[15]  Yifan Gong,et al.  Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[16]  Gautham J. Mysore,et al.  Language informed bandwidth expansion , 2012, 2012 IEEE International Workshop on Machine Learning for Signal Processing.

[17]  Richard M. Stern,et al.  Sources of degradation of speech recognition in the telephone network , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Jun Du,et al.  Joint training of front-end and back-end deep neural networks for robust speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Alex Acero,et al.  Robust bandwidth extension of noise-corrupted narrowband speech , 2005, INTERSPEECH.

[20]  Jonathan G. Fiscus,et al.  Tools for the analysis of benchmark speech recognition tests , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[21]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[22]  Bo Xu,et al.  Improving wideband acoustic models using mixed-bandwidth training data via DNN adaptation , 2014, INTERSPEECH.

[23]  Douglas D. O'Shaughnessy,et al.  Statistical recovery of wideband speech from narrowband speech , 1992, IEEE Trans. Speech Audio Process..

[24]  Frank K. Soong,et al.  A maximum a Posterior-based reconstruction approach to speech bandwidth expansion in noise , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).