Artificial Speech Bandwidth Extension Using Deep Neural Networks for Wideband Spectral Envelope Estimation

Estimating a wideband spectral envelope having only narrowband speech at hand is a challenging task. In this paper, we explore ways to do so in the context of an artificial speech bandwidth extension (ABE) framework. Starting from a typical hidden Markov model (HMM)/Gaussian mixture model baseline scheme, we investigate two types of features, topologies, and regularization approaches of deep neural networks (DNNs) to obtain estimates of wideband spectral envelopes with smallest cepstral distance to the original ones. In order to draw realistic conclusions, we employ a database for test, which is acoustically different to the training and validation speech material. Interestingly, it turns out that a DNN regression approach outperforms all other investigated methods, although the HMM has been dropped. Cepstral distance was reduced by 1.18 dB, wideband PESQ was improved by 0.23 MOS points, and a subjective comparison category rating listening test showed a significant preference of the best DNN ABE approach versus narrowband speech of 1.37 CMOS points.

[1]  Paavo Alku,et al.  Neural Network-Based Artificial Bandwidth Expansion of Speech , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Patrick Bauer,et al.  On improving speech intelligibility in automotive hands-free systems , 2010, IEEE International Symposium on Consumer Electronics (ISCE 2010).

[3]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[4]  Alan McCree,et al.  A robust narrowband to wideband extension system featuring enhanced codebook mapping , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[5]  Roar Hagen,et al.  Spectral quantization of cepstral coefficients , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[7]  David G. Stork,et al.  Pattern Classification , 1973 .

[8]  Shenghui Zhao,et al.  Speech Bandwidth Extension Based on GMM and Clustering Method , 2015, 2015 Fifth International Conference on Communication Systems and Network Technologies.

[9]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[10]  Bin Liu,et al.  A novel method of artificial bandwidth extension using deep architecture , 2015, INTERSPEECH.

[11]  Patrick Bauer,et al.  On speech quality assessment of artificial bandwidth extension , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  John E. Markel,et al.  Linear Prediction of Speech , 1976, Communication and Cybernetics.

[13]  Paavo Alku,et al.  Speech quality prediction for artificial bandwidth extension algorithms , 2013, INTERSPEECH.

[14]  Israel Cohen,et al.  Evaluation of a Speech Bandwidth Extension Algorithm Based on Vocal Tract Shape Estimation , 2012, IWAENC.

[15]  Cyril Guillaume,et al.  An Instrumental Quality Measure for Artificially Bandwidth-Extended Speech Signals , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Shenghui Zhao,et al.  Speech bandwidth expansion based on deep neural networks , 2015, INTERSPEECH.

[17]  Patrick Bauer,et al.  Impact of hearing impairment on fricative intelligibility for artificially bandwidth-extended telephone speech in noise , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[19]  Patrick Bauer,et al.  A statistical framework for artificial bandwidth extension exploiting speech waveform and phonetic transcription , 2009, 2009 17th European Signal Processing Conference.

[20]  Franz Pernkopf,et al.  Modeling speech with sum-product networks: Application to bandwidth extension , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Grgoire Montavon,et al.  Neural Networks: Tricks of the Trade , 2012, Lecture Notes in Computer Science.

[22]  Chin-Hui Lee,et al.  A deep neural network approach to speech bandwidth expansion , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[24]  Zhen-Hua Ling,et al.  Restoring high frequency spectral envelopes using neural networks for speech bandwidth extension , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[25]  Tim Fingscheidt,et al.  A Phonetic Reference Paradigm for Instrumental Speech Quality Assessment of Artificial Speech Bandwidth Extension , 2017 .

[26]  John Makhoul,et al.  High-frequency regeneration in speech coding systems , 1979, ICASSP.

[27]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[28]  Khalid Choukri,et al.  SPEECHDAT-CAR. A Large Speech Database for Automotive Environments , 2000, LREC.

[29]  Paavo Alku,et al.  A subjective listening test of six different artificial bandwidth extension approaches in English, Chinese, German, and Korean , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Engin Erzin,et al.  Artificial bandwidth extension of spectral envelope along a Viterbi path , 2013, Speech Commun..

[31]  Peter J. Patrick Enhancement of band-limited speech signals , 1983 .

[32]  Tim Fingscheidt,et al.  Reference-free SNR Measurement for Narrowband and Wideband Speech Signals in Car Noise , 2012, ITG Conference on Speech Communication.

[33]  Chandra Sekhar Seelamantula,et al.  Joint dictionary training for bandwidth extension of speech signals , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Patrick Bauer,et al.  HMM-based artificial bandwidth extension supported by neural networks , 2014, 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC).

[35]  Tobias Kaufmann,et al.  Sprachverarbeitung: Grundlagen und Methoden der Sprachsynthese und Spracherkennung (Springer-Lehrbuch) , 2008 .

[36]  J. C. Steinberg,et al.  Factors Governing the Intelligibility of Speech Sounds , 1945 .

[37]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[38]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[39]  Franz Pernkopf,et al.  On representation learning for artificial bandwidth extension , 2015, INTERSPEECH.

[40]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[41]  Peter Kabal,et al.  Memory-Based Approximation of the Gaussian Mixture Model Framework for Bandwidth Extension of Narrowband Speech , 2011, INTERSPEECH.

[42]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[43]  Peter Jax,et al.  Wideband extension of telephone speech using a hidden Markov model , 2000, 2000 IEEE Workshop on Speech Coding. Proceedings. Meeting the Challenges of the New Millennium (Cat. No.00EX421).

[44]  Paavo Alku,et al.  Bandwidth Extension of Telephone Speech Using a Neural Network and a Filter Bank Implementation for Highband Mel Spectrum , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[45]  Tim Fingscheidt,et al.  Artificial bandwidth extension using deep neural networks for spectral envelope estimation , 2016, 2016 IEEE International Workshop on Acoustic Signal Enhancement (IWAENC).

[46]  Li-Rong Dai,et al.  Speech Bandwidth Extension Using Bottleneck Features and Deep Recurrent Neural Networks , 2016, INTERSPEECH.