Bone-conducted speech enhancement using deep denoising autoencoder

Abstract Bone-conduction microphones (BCMs) capture speech signals based on the vibrations of the speaker's skull and exhibit better noise-resistance capabilities than normal air-conduction microphones (ACMs) when transmitting speech signals. Because BCMs only capture the low-frequency portion of speech signals, their frequency response is quite different from that of ACMs. When replacing an ACM with a BCM, we may obtain satisfactory results with respect to noise suppression, but the speech quality and intelligibility may be degraded due to the nature of the solid vibration. The mismatched characteristics of BCM and ACM can also impact the automatic speech recognition (ASR) performance, and it is infeasible to recreate a new ASR system using the voice data from BCMs. In this study, we propose a novel deep-denoising autoencoder (DDAE) approach to bridge BCM and ACM in order to improve speech quality and intelligibility, and the current ASR could be employed directly without recreating a new system. Experimental results first demonstrated that the DDAE approach can effectively improve speech quality and intelligibility based on standardized evaluation metrics. Moreover, our proposed system can significantly improve the ASR performance by a notable 48.28% relative character error rate (CER) reduction (from 14.50% to 7.50%) under quiet conditions. In an actual noisy environment (sound pressure from 61.7 dBA to 73.9 dBA), our proposed system with a BCM outperforms an ACM, yielding an 84.46% reduction in the relative CER (proposed system: 9.13% and ACM: 58.75%).

[1]  Jun Du,et al.  Multiple-target deep learning for LSTM-RNN based speech enhancement , 2017, 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA).

[2]  John R. Hershey,et al.  Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks , 2015, INTERSPEECH.

[3]  Hirokazu Kameoka,et al.  A noise suppression method for body-conducted soft speech based on non-negative tensor factorization of air- and body-conducted signals , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Biing-Hwang Juang,et al.  Adversarial Teacher-Student Learning for Unsupervised Domain Adaptation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Yu Tsao,et al.  Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[6]  Chris Donahue,et al.  Exploring Speech Enhancement with Generative Adversarial Networks for Robust Speech Recognition , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  DeLiang Wang,et al.  Cocktail Party Processing via Structured Prediction , 2012, NIPS.

[8]  David V. Anderson,et al.  Speech enhancement using extreme learning machines , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[9]  Yu Tsao,et al.  Ensemble modeling of denoising autoencoder for speech spectrum restoration , 2014, INTERSPEECH.

[10]  Werner Verhelst,et al.  Improved speech recognition in noisy environments by using a throat microphone for accurate voicing detection , 2010, 2010 18th European Signal Processing Conference.

[11]  Yu Tsao,et al.  Complex spectrogram enhancement by convolutional neural network with multi-metrics learning , 2017, 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).

[12]  Panayiotis G. Georgiou,et al.  Perception Optimized Deep Denoising AutoEncoders for Speech Enhancement , 2016, INTERSPEECH.

[13]  Changchun Bao,et al.  Wiener filtering based speech enhancement with Weighted Denoising Auto-encoder and noise classification , 2014, Speech Commun..

[14]  Zicheng Liu,et al.  Direct filtering for air- and bone-conductive microphones , 2004, IEEE 6th Workshop on Multimedia Signal Processing, 2004..

[15]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[16]  Björn W. Schuller,et al.  Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR , 2015, LVA/ICA.

[17]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[19]  Jürgen Schmidhuber,et al.  Improving Speaker-Independent Lipreading with Domain-Adversarial Training , 2017, INTERSPEECH.

[20]  Tetsuya Shimamura,et al.  Quality improvement of bone-conducted speech , 2005, Proceedings of the 2005 European Conference on Circuit Theory and Design, 2005..

[21]  T. Shimamura,et al.  Improving Bone-Conducted Speech Quality via Neural Network , 2006, 2006 IEEE International Symposium on Signal Processing and Information Technology.

[22]  Yifan Gong,et al.  An Overview of Noise-Robust Automatic Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  Simon Haykin,et al.  Advances in spectrum analysis and array processing , 1991 .

[24]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[25]  Zicheng Liu,et al.  Multi-sensory microphones for robust speech detection, enhancement and recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[26]  Ji Wu,et al.  Rapid adaptation for deep neural networks through multi-task learning , 2015, INTERSPEECH.

[27]  Chih-Hao Fang,et al.  台灣地區噪音下漢語語音聽辨測驗之軟體發展;Software Development of Taiwan Mandarin Hearing In Noise Test , 2018 .

[28]  Haizhou Li,et al.  Unsupervised Domain Adaptation via Domain Adversarial Training for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Zheng-Hua Tan,et al.  Conditional Generative Adversarial Networks for Speech Enhancement and Noise-Robust Speaker Verification , 2017, INTERSPEECH.

[30]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[31]  Tatsuya Kawahara,et al.  Cross-domain speech recognition using nonparallel corpora with cycle-consistent adversarial networks , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[32]  Jonathan Le Roux,et al.  Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Jun Du,et al.  Gaussian density guided deep neural network for single-channel speech enhancement , 2017, 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).

[34]  Zheng-Hua Tan,et al.  Speech enhancement using Long Short-Term Memory based recurrent Neural Networks for noise robust Speaker Verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[35]  Thomas Lenarz,et al.  Amplitude-Mapping Effects on Speech Intelligibility With Unilateral and Bilateral Cochlear Implants , 2005, Ear and hearing.

[36]  Yu Tsao,et al.  Deep Learning–Based Noise Reduction Approach to Improve Speech Intelligibility for Cochlear Implant Recipients , 2018, Ear and hearing.

[37]  Xuedong Huang,et al.  Air- and bone-conductive integrated microphones for robust speech detection and enhancement , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[38]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[39]  Yu Tsao,et al.  Experimental Study on Extreme Learning Machine Applications for Speech Enhancement , 2017, IEEE Access.

[40]  Yu Tsao,et al.  A Deep Denoising Autoencoder Approach to Improving the Intelligibility of Vocoded Speech in Cochlear Implant Simulation , 2017, IEEE Transactions on Biomedical Engineering.

[41]  Yu Tsao,et al.  End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[42]  J. Flanagan Speech Analysis, Synthesis and Perception , 1971 .

[43]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[44]  Yu Tsao,et al.  SNR-Aware Convolutional Neural Network Modeling for Speech Enhancement , 2016, INTERSPEECH.

[45]  H. Franco,et al.  Combining standard and throat microphones for robust speech recognition , 2003, IEEE Signal Processing Letters.

[46]  Hang-hong Kuo,et al.  The Bone Conduction Microphone Parameter Measurement Architecture and Its Speech Recognization Performance Analysis , 2015 .

[47]  Antonio Bonafonte,et al.  SEGAN: Speech Enhancement Generative Adversarial Network , 2017, INTERSPEECH.

[48]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[49]  Yu Tsao,et al.  Effects of Adaptation Rate and Noise Suppression on the Intelligibility of Compressed-Envelope Based Speech , 2015, PloS one.