Time-Domain Multi-Modal Bone/Air Conducted Speech Enhancement

Previous studies have proven that integrating video signals, as a complementary modality, can facilitate improved performance for speech enhancement (SE). However, video clips usually contain large amounts of data and pose a high cost in terms of computational resources and thus may complicate the SE system. As an alternative source, a bone-conducted speech signal has a moderate data size while manifesting speech-phoneme structures, and thus complements its air-conducted counterpart. In this study, we propose a novel multi-modal SE structure in the time domain that leverages bone- and air-conducted signals. In addition, we examine two ensemble-learning-based strategies, early fusion (EF) and late fusion (LF), to integrate the two types of speech signals, and adopt a deep learning-based fully convolutional network to conduct the enhancement. The experiment results on the Mandarin corpus indicate that this newly presented multi-modal (integrating bone- and air-conducted signals) SE structure significantly outperforms the single-source SE counterparts (with a bone- or air-conducted signal only) in various speech evaluation metrics. In addition, the adoption of an LF strategy other than an EF in this novel SE multi-modal structure achieves better results.

[1]  DeLiang Wang,et al.  A New Framework for CNN-Based Speech Enhancement in the Time Domain , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2]  DeLiang Wang,et al.  Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises. , 2016, The Journal of the Acoustical Society of America.

[3]  DeLiang Wang,et al.  Speech segregation based on sound localization , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[4]  Shih-Hau Fang,et al.  Speaker-Aware Deep Denoising Autoencoder with Embedded Speaker Identity for Speech Enhancement , 2019, INTERSPEECH.

[5]  Tetsuji Ogawa,et al.  Multi-Channel Speech Enhancement Using Time-Domain Convolutional Denoising Autoencoder , 2019, INTERSPEECH.

[6]  Xiongwei Zhang,et al.  A Speech enhancement scheme based on bone-conducted speech , 2018, ICMSSP '18.

[7]  Hong-Goo Kang,et al.  Survey of Speech Enhancement Supported by a Bone Conduction Microphone , 2012, ITG Conference on Speech Communication.

[8]  Shmuel Peleg,et al.  Visual Speech Enhancement , 2017, INTERSPEECH.

[9]  John H. L. Hansen,et al.  Speech Enhancement Based on Generalized Minimum Mean Square Error Estimators and Masking Properties of the Auditory System , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Yu Tsao,et al.  Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[11]  平山亮 会議報告-Speechreading by Humans and Machines; Models Systems and Applications , 1997 .

[12]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Jundong Liu,et al.  Dilated FCN: Listening Longer to Hear Better , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[14]  Margaret Lech,et al.  Evaluating deep learning architectures for Speech Emotion Recognition , 2017, Neural Networks.

[15]  Zheng-Hua Tan,et al.  Conditional Generative Adversarial Networks for Speech Enhancement and Noise-Robust Speaker Verification , 2017, INTERSPEECH.

[16]  Masakiyo Fujimoto,et al.  Exploring multi-channel features for denoising-autoencoder-based speech enhancement , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Pascal Scalart,et al.  Speech enhancement based on a priori signal to noise estimation , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[18]  Rick Siow Mong Goh,et al.  Multi-Modal Hybrid Deep Neural Network for Speech Enhancement , 2016, ArXiv.

[19]  DeLiang Wang,et al.  A New Framework for Supervised Speech Enhancement in the Time Domain , 2018, INTERSPEECH.

[20]  Bhaskar D. Rao,et al.  Bone-Conduction Sensor Assisted Noise Estimation for Improved Speech Enhancement , 2018, INTERSPEECH.

[21]  Israel Cohen,et al.  Multisensory speech enhancement in noisy environments using bone-conducted and air-conducted microphones , 2014, 2014 IEEE China Summit & International Conference on Signal and Information Processing (ChinaSIP).

[22]  Hirokazu Kameoka,et al.  A noise suppression method for body-conducted soft speech based on non-negative tensor factorization of air- and body-conducted signals , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Jesper Jensen,et al.  Speech Intelligibility Potential of General and Specialized Deep Neural Network Based Speech Enhancement Systems , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24]  DeLiang Wang,et al.  Real-time Speech Enhancement Using an Efficient Convolutional Recurrent Network for Dual-microphone Mobile Phones in Close-talk Scenarios , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Yifan Gong,et al.  Robust automatic speech recognition : a bridge to practical application , 2015 .

[26]  Yu Tsao,et al.  Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks , 2017, IEEE Transactions on Emerging Topics in Computational Intelligence.

[27]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[28]  Chin-Hui Lee,et al.  Tensor-To-Vector Regression for Multi-Channel Speech Enhancement Based on Tensor-Train Network , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Yu Tsao,et al.  Bone-conducted speech enhancement using deep denoising autoencoder , 2018, Speech Commun..

[30]  Alex Bateman,et al.  An introduction to hidden Markov models. , 2007, Current protocols in bioinformatics.

[31]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  DeLiang Wang,et al.  TCNN: Temporal Convolutional Neural Network for Real-time Speech Enhancement in the Time Domain , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[34]  Tetsuya Shimamura,et al.  Quality improvement of bone-conducted speech , 2005, Proceedings of the 2005 European Conference on Circuit Theory and Design, 2005..

[35]  Chih-Hao Fang,et al.  台灣地區噪音下漢語語音聽辨測驗之軟體發展;Software Development of Taiwan Mandarin Hearing In Noise Test , 2018 .

[36]  Jonathan Le Roux,et al.  MIMO-Speech: End-to-End Multi-Channel Multi-Speaker Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[37]  Jesper Jensen,et al.  An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[38]  Yu Tsao,et al.  Raw waveform-based speech enhancement by fully convolutional networks , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[39]  Jinwei Sun,et al.  A wearable bone-conducted speech enhancement system for strong background noises , 2017, 2017 18th International Conference on Electronic Packaging Technology (ICEPT).

[40]  Zicheng Liu,et al.  Multi-sensory microphones for robust speech detection, enhancement and recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[41]  Geoffrey Zweig,et al.  Recent advances in deep learning for speech research at Microsoft , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[42]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[43]  Yu Tsao,et al.  A Deep Denoising Autoencoder Approach to Improving the Intelligibility of Vocoded Speech in Cochlear Implant Simulation , 2017, IEEE Transactions on Biomedical Engineering.

[44]  Yu Tsao,et al.  End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[45]  John R. Hershey,et al.  Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks , 2015, INTERSPEECH.

[46]  Xuedong Huang,et al.  Air- and bone-conductive integrated microphones for robust speech detection and enhancement , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[47]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[48]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[49]  DeLiang Wang,et al.  Speech segregation based on pitch tracking and amplitude modulation , 2001, Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575).

[50]  Mark D. Plumbley,et al.  Raw Multi-Channel Audio Source Separation using Multi- Resolution Convolutional Auto-Encoders , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[51]  D. Stork,et al.  Speechreading by Man and Machine: Models, Systems, and Applications , 1996 .

[52]  Ya Zhang,et al.  Deep feature for text-dependent speaker verification , 2015, Speech Commun..

[53]  Kenji Kimura,et al.  A Study on Restoration of Bone-Conducted Speech with MTF-Based and LP-Based Models (Special Issue on Nonlinear Circuits and Signal Processing) , 2006 .

[54]  T. Shimamura,et al.  Improving Bone-Conducted Speech Quality via Neural Network , 2006, 2006 IEEE International Symposium on Signal Processing and Information Technology.

[55]  Tetsuya Shimamura,et al.  Multisensory speech enhancement using lower‐frequency components from bone‐conducted speech , 2019 .