论文信息 - Time-Domain Multi-Modal Bone/Air Conducted Speech Enhancement

Time-Domain Multi-Modal Bone/Air Conducted Speech Enhancement

Previous studies have proven that integrating video signals, as a complementary modality, can facilitate improved performance for speech enhancement (SE). However, video clips usually contain large amounts of data and pose a high cost in terms of computational resources and thus may complicate the SE system. As an alternative source, a bone-conducted speech signal has a moderate data size while manifesting speech-phoneme structures, and thus complements its air-conducted counterpart. In this study, we propose a novel multi-modal SE structure in the time domain that leverages bone- and air-conducted signals. In addition, we examine two ensemble-learning-based strategies, early fusion (EF) and late fusion (LF), to integrate the two types of speech signals, and adopt a deep learning-based fully convolutional network to conduct the enhancement. The experiment results on the Mandarin corpus indicate that this newly presented multi-modal (integrating bone- and air-conducted signals) SE structure significantly outperforms the single-source SE counterparts (with a bone- or air-conducted signal only) in various speech evaluation metrics. In addition, the adoption of an LF strategy other than an EF in this novel SE multi-modal structure achieves better results.

Yu Tsao | Cheng Yu | Syu-Siang Wang | Jeih-weih Hung | Kuo-Hsuan Hung

[1] DeLiang Wang,et al. A New Framework for CNN-Based Speech Enhancement in the Time Domain , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2] DeLiang Wang,et al. Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises. , 2016, The Journal of the Acoustical Society of America.

[3] DeLiang Wang,et al. Speech segregation based on sound localization , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[4] Shih-Hau Fang,et al. Speaker-Aware Deep Denoising Autoencoder with Embedded Speaker Identity for Speech Enhancement , 2019, INTERSPEECH.

[5] Tetsuji Ogawa,et al. Multi-Channel Speech Enhancement Using Time-Domain Convolutional Denoising Autoencoder , 2019, INTERSPEECH.

[6] Xiongwei Zhang,et al. A Speech enhancement scheme based on bone-conducted speech , 2018, ICMSSP '18.

[7] Hong-Goo Kang,et al. Survey of Speech Enhancement Supported by a Bone Conduction Microphone , 2012, ITG Conference on Speech Communication.

[8] Shmuel Peleg,et al. Visual Speech Enhancement , 2017, INTERSPEECH.

[9] John H. L. Hansen,et al. Speech Enhancement Based on Generalized Minimum Mean Square Error Estimators and Masking Properties of the Auditory System , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[10] Yu Tsao,et al. Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[11] 平山亮. 会議報告－Speechreading by Humans and Machines; Models Systems and Applications , 1997 .

[12] Jesper Jensen,et al. An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[13] Jundong Liu,et al. Dilated FCN: Listening Longer to Hear Better , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[14] Margaret Lech,et al. Evaluating deep learning architectures for Speech Emotion Recognition , 2017, Neural Networks.

[15] Zheng-Hua Tan,et al. Conditional Generative Adversarial Networks for Speech Enhancement and Noise-Robust Speaker Verification , 2017, INTERSPEECH.

[16] Masakiyo Fujimoto,et al. Exploring multi-channel features for denoising-autoencoder-based speech enhancement , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17] Pascal Scalart,et al. Speech enhancement based on a priori signal to noise estimation , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[18] Rick Siow Mong Goh,et al. Multi-Modal Hybrid Deep Neural Network for Speech Enhancement , 2016, ArXiv.

[19] DeLiang Wang,et al. A New Framework for Supervised Speech Enhancement in the Time Domain , 2018, INTERSPEECH.

[20] Bhaskar D. Rao,et al. Bone-Conduction Sensor Assisted Noise Estimation for Improved Speech Enhancement , 2018, INTERSPEECH.

[21] Israel Cohen,et al. Multisensory speech enhancement in noisy environments using bone-conducted and air-conducted microphones , 2014, 2014 IEEE China Summit & International Conference on Signal and Information Processing (ChinaSIP).

[22] Hirokazu Kameoka,et al. A noise suppression method for body-conducted soft speech based on non-negative tensor factorization of air- and body-conducted signals , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23] Jesper Jensen,et al. Speech Intelligibility Potential of General and Specialized Deep Neural Network Based Speech Enhancement Systems , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24] DeLiang Wang,et al. Real-time Speech Enhancement Using an Efficient Convolutional Recurrent Network for Dual-microphone Mobile Phones in Close-talk Scenarios , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25] Yifan Gong,et al. Robust automatic speech recognition : a bridge to practical application , 2015 .

[26] Yu Tsao,et al. Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks , 2017, IEEE Transactions on Emerging Topics in Computational Intelligence.

[27] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[28] Chin-Hui Lee,et al. Tensor-To-Vector Regression for Multi-Channel Speech Enhancement Based on Tensor-Train Network , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29] Yu Tsao,et al. Bone-conducted speech enhancement using deep denoising autoencoder , 2018, Speech Commun..

[30] Alex Bateman,et al. An introduction to hidden Markov models. , 2007, Current protocols in bioinformatics.

[31] Sanjeev Khudanpur,et al. X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32] DeLiang Wang,et al. TCNN: Temporal Convolutional Neural Network for Real-time Speech Enhancement in the Time Domain , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33] Li-Rong Dai,et al. A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[34] Tetsuya Shimamura,et al. Quality improvement of bone-conducted speech , 2005, Proceedings of the 2005 European Conference on Circuit Theory and Design, 2005..

[35] Chih-Hao Fang,et al. 台灣地區噪音下漢語語音聽辨測驗之軟體發展;Software Development of Taiwan Mandarin Hearing In Noise Test , 2018 .

[36] Jonathan Le Roux,et al. MIMO-Speech: End-to-End Multi-Channel Multi-Speaker Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[37] Jesper Jensen,et al. An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[38] Yu Tsao,et al. Raw waveform-based speech enhancement by fully convolutional networks , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[39] Jinwei Sun,et al. A wearable bone-conducted speech enhancement system for strong background noises , 2017, 2017 18th International Conference on Electronic Packaging Technology (ICEPT).

[40] Zicheng Liu,et al. Multi-sensory microphones for robust speech detection, enhancement and recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[41] Geoffrey Zweig,et al. Recent advances in deep learning for speech research at Microsoft , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[42] Andries P. Hekstra,et al. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[43] Yu Tsao,et al. A Deep Denoising Autoencoder Approach to Improving the Intelligibility of Vocoded Speech in Cochlear Implant Simulation , 2017, IEEE Transactions on Biomedical Engineering.

[44] Yu Tsao,et al. End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[45] John R. Hershey,et al. Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks , 2015, INTERSPEECH.

[46] Xuedong Huang,et al. Air- and bone-conductive integrated microphones for robust speech detection and enhancement , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[47] DeLiang Wang,et al. Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[48] Geoffrey E. Hinton,et al. Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[49] DeLiang Wang,et al. Speech segregation based on pitch tracking and amplitude modulation , 2001, Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575).

[50] Mark D. Plumbley,et al. Raw Multi-Channel Audio Source Separation using Multi- Resolution Convolutional Auto-Encoders , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[51] D. Stork,et al. Speechreading by Man and Machine: Models, Systems, and Applications , 1996 .

[52] Ya Zhang,et al. Deep feature for text-dependent speaker verification , 2015, Speech Commun..

[53] Kenji Kimura,et al. A Study on Restoration of Bone-Conducted Speech with MTF-Based and LP-Based Models (Special Issue on Nonlinear Circuits and Signal Processing) , 2006 .

[54] T. Shimamura,et al. Improving Bone-Conducted Speech Quality via Neural Network , 2006, 2006 IEEE International Symposium on Signal Processing and Information Technology.

[55] Tetsuya Shimamura,et al. Multisensory speech enhancement using lower‐frequency components from bone‐conducted speech , 2019 .