Removal by Denoising Autoencoder in Speech Recognition

Correspondence: wangdong99@cslt.riit.tsinghua.edu.cn CSLT, RIIT, Tsinghua University, 100084 Beijing, China Full list of author information is available at the end of the article Abstract Music embedding often causes significant performance degradation in automatic speech recognition (ASR). This paper proposes a music-removal method based on the denoising autoencoder (DAE) that learns and removes music from music-embedded speech signals. Our study shows that the DAE model can learn patterns of music in different genres and the DAE-based music removal offers significant performance improvement for ASR. Furthermore, involving convolutional feature extraction offers additional performance gains. Finally, we demonstrate that the music-removal DAE is largely language independent, which means that a model trained with data in one language can be applied to remove music from speech in another language, and models trained with multilingual data may lead to better performance.

[1]  Kiyotoshi Matsuoka,et al.  Noise injection into inputs in back-propagation learning , 1992, IEEE Trans. Syst. Man Cybern..

[2]  Yves Grandvalet,et al.  Comments on "Noise injection into inputs in back propagation learning" , 1995, IEEE Trans. Syst. Man Cybern..

[3]  P. Vanroose,et al.  BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION , 2003 .

[4]  Changshui Zhang,et al.  Separation of Music Signals by Harmonic Structure Modeling , 2005, NIPS.

[5]  Queen Mary MUSICAL AUDIO STREAM SEPARATION BY NON-NEGATIVE MATRIX FACTORIZATION , 2005 .

[6]  Andreas Stolcke,et al.  Cross-Domain and Cross-Language Portability of Acoustic Features Estimated by Multilayer Perceptrons , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[7]  Anssi Klapuri,et al.  Accompaniment separation and karaoke application based on automatic melody transcription , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[8]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[9]  A. Chanrungutai,et al.  Singing voice separation for mono-channel music using Non-negative Matrix Factorization , 2008, 2008 International Conference on Advanced Technologies for Communications.

[10]  Bryan Pardo,et al.  A simple music/voice separation method based on the extraction of the repeating musical structure , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Emad M. Grais,et al.  Single channel speech music separation using nonnegative matrix factorization and spectral masks , 2011, 2011 17th International Conference on Digital Signal Processing (DSP).

[12]  Jean-Philippe Thiran,et al.  Musical Audio Source Separation Based on User-Selected F0 Track , 2012, LVA/ICA.

[13]  Trausti T. Kristjansson,et al.  Music models for music-speech separation , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Antoine Liutkus,et al.  Adaptive filtering for music/voice separation exploiting the repeating musical structure , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Quoc V. Le,et al.  Recurrent Neural Networks for Noise Reduction in Robust ASR , 2012, INTERSPEECH.

[16]  Guillermo Sapiro,et al.  Real-time Online Singing Voice Separation from Monaural Recordings Using Robust Low-rank Modeling , 2012, ISMIR.

[17]  Ngoc Thang Vu,et al.  Multilingual bottle-neck features and its application for under-resourced languages , 2012, SLTU.

[18]  Paris Smaragdis,et al.  Singing-voice separation from monaural recordings using robust principal component analysis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Yasuo Horiuchi,et al.  Reverberant speech recognition based on denoising autoencoder , 2013, INTERSPEECH.

[20]  Ruijiang Li,et al.  Multi-Stage Non-Negative Matrix Factorization for Monaural Singing Voice Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Xie Xiang,et al.  NMF based speech and music separation in monaural speech recordings with sparseness and temporal continuity constraints , 2013, ICMT 2013.

[22]  Kyogu Lee,et al.  Vocal separation using extended robust principal component analysis with Schatten p/lp-norm and scale compression , 2014, 2014 IEEE International Workshop on Machine Learning for Signal Processing (MLSP).

[23]  Shigeki Sagayama,et al.  Singing Voice Enhancement in Monaural Music Signals Based on Two-stage Harmonic/Percussive Sound Separation on Multiple Resolution Spectrograms , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24]  Thomas Fang Zheng,et al.  Noisy training for deep neural networks in speech recognition , 2015, EURASIP Journal on Audio, Speech, and Music Processing.