DNN-based speech bandwidth expansion and its application to adding high-frequency missing features for automatic speech recognition of narrowband speech

We propose a number of enhancement techniques to improve speech quality in bandwidth expansion (BWE) from narrowband to wideband speech, addressing three issues, which could be critical in real-world applications, namely: (1) discontinuity between narrowband spectrum and the estimated high frequency spectrum, (2) energy mismatch between testing and training utterances, and (3) expanding bandwidth of out-ofdomain speech signals. With an inherent prediction of missing high frequency features in bandwidth-expanded speech we also explore the feasibility of adding these estimated features to those extracted from narrowband speech in order to improve the system performance for automatic speech recognition (ASR) of narrowband speech. Leveraging upon a recently-proposed deep neural network based speech BWE system intended for hearing quality enhancement these techniques not only improve over the traditionally-adopted objective and subjective measures but also reduce the word error rate (WER) from 8.67% when recognizing narrowband speech to 8.26% when recognizing bandwidthexpanded speech, and almost approaching the WER of 8.12% when recognizing wideband speech in the 20,000-word openvocabulary Wall Street Journal ASR task.

[1]  Steve J. Young,et al.  Partially observable Markov decision processes for spoken dialog systems , 2007, Comput. Speech Lang..

[2]  Masashi Unoki,et al.  Robust voice activity detection based on concept of modulation transfer function in noisy reverberant environments , 2014, ISCSLP.

[3]  Qin Yan,et al.  Speech Bandwidth Extension: Extrapolations of Spectral Envelop and Harmonicity Quality of Excitation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[4]  Rich Caruana,et al.  Multitask Learning: A Knowledge-Based Source of Inductive Bias , 1993, ICML.

[5]  Jaap C. Haartsen,et al.  The Bluetooth radio system , 2000, IEEE Personal Communications.

[6]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[7]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[8]  Schuyler Quackenbush,et al.  Objective measures of speech quality , 1995 .

[9]  Roberto Pieraccini,et al.  Where do we go from here? Research and Commercial Spoken Dialog Systems , 2005, SIGDIAL.

[10]  B. Schneirdeman,et al.  Designing the User Interface: Strategies for Effective Human-Computer Interaction , 1998 .

[11]  Mark A. Clements,et al.  Sparse probabilistic state mapping and its application to speech bandwidth expansion , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Hyung Soon Kim,et al.  Narrowband to wideband conversion of speech using GMM based transformation , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[13]  S. Joy Mountford,et al.  The Art of Human-Computer Interface Design , 1990 .

[14]  Victor Zue,et al.  JUPlTER: a telephone-based conversational interface for weather information , 2000, IEEE Trans. Speech Audio Process..

[15]  Gerhard Schmidt,et al.  Neural networks versus codebooks in an application for bandwidth extension of speech signals , 2003, INTERSPEECH.

[16]  Geun-Bae Song,et al.  A study of HMM-based bandwidth extension of speech signals , 2009, Signal Process..

[17]  C. Marvin When Old Technologies Were New , 2010 .

[18]  Gautham J. Mysore,et al.  Language informed bandwidth expansion , 2012, 2012 IEEE International Workshop on Machine Learning for Signal Processing.

[19]  Juan Manuel Górriz,et al.  Voice Activity Detection. Fundamentals and Speech Recognition System Robustness , 2007 .

[20]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Chin-Hui Lee,et al.  A deep neural network approach to speech bandwidth expansion , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Mei-Yuh Hwang,et al.  Shared-distribution hidden Markov models for speech recognition , 1993, IEEE Trans. Speech Audio Process..

[23]  Jacob Benesty,et al.  Spectral Enhancement Methods , 2009 .

[24]  L. R. Rabiner,et al.  An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition , 1983, The Bell System Technical Journal.

[25]  Kuldip K. Paliwal,et al.  Automatic Speech and Speaker Recognition: Advanced Topics , 1999 .

[26]  Frank K. Soong,et al.  A maximum a Posterior-based reconstruction approach to speech bandwidth expansion in noise , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Yoshihisa Nakatoh,et al.  Generation of broadband speech from narrowband speech based on linear mapping , 2002 .

[28]  Ohad Shamir,et al.  Optimal Distributed Online Prediction , 2011, ICML.

[29]  Biing-Hwang Juang,et al.  Recurrent deep neural networks for robust speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Olli Viikki,et al.  Cepstral domain segmental feature vector normalization for noise robust speech recognition , 1998, Speech Commun..

[31]  Julien Epps,et al.  A new technique for wideband enhancement of coded narrowband speech , 1999, 1999 IEEE Workshop on Speech Coding Proceedings. Model, Coders, and Error Criteria (Cat. No.99EX351).

[32]  Gerhard Schmidt,et al.  Bandwidth Extension of Telephony Speech , 2008 .

[33]  Geoffrey E. Hinton,et al.  A Learning Algorithm for Boltzmann Machines , 1985, Cogn. Sci..

[34]  Vladimir Pavlovic,et al.  Toward multimodal human-computer interface , 1998, Proc. IEEE.

[35]  Hermann Ney,et al.  Computing Mel-frequency cepstral coefficients on the power spectrum , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[36]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[37]  D. M. Allen Mean Square Error of Prediction as a Criterion for Selecting Variables , 1971 .

[38]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.