Music removal by convolutional denoising autoencoder in speech recognition

Music embedding often causes significant performance degradation in automatic speech recognition (ASR). This paper proposes a music-removal method based on denoising autoencoder (DAE) that learns and removes music from music-embedded speech signals. Particularly, we focus on convolutional denoising autoencoder (CDAE) that can learn local musical patterns by convolutional feature extraction. Our study shows that the CDAE model can learn patterns of music in different genres and the CDAE-based music removal offers significant performance improvement for ASR. Additionally, we demonstrate that this music-removal approach is largely language independent, which means that a model trained with data in one language can be applied to remove music from speech in another language, and models trained with multilingual data may lead to better performance.

[1]  Dong Yu,et al.  Deep Learning: Methods and Applications , 2014, Found. Trends Signal Process..

[2]  P. Vanroose,et al.  BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION , 2003 .

[3]  Yasuo Horiuchi,et al.  Reverberant speech recognition based on denoising autoencoder , 2013, INTERSPEECH.

[4]  Paris Smaragdis,et al.  Singing-voice separation from monaural recordings using robust principal component analysis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Anssi Klapuri,et al.  Accompaniment separation and karaoke application based on automatic melody transcription , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[6]  Xie Xiang,et al.  NMF based speech and music separation in monaural speech recordings with sparseness and temporal continuity constraints , 2013, ICMT 2013.

[7]  Guillermo Sapiro,et al.  Real-time Online Singing Voice Separation from Monaural Recordings Using Robust Low-rank Modeling , 2012, ISMIR.

[8]  Changshui Zhang,et al.  Separation of Music Signals by Harmonic Structure Modeling , 2005, NIPS.

[9]  Ruijiang Li,et al.  Multi-Stage Non-Negative Matrix Factorization for Monaural Singing Voice Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Trausti T. Kristjansson,et al.  Music models for music-speech separation , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Shigeki Sagayama,et al.  Singing Voice Enhancement in Monaural Music Signals Based on Two-stage Harmonic/Percussive Sound Separation on Multiple Resolution Spectrograms , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Hynek Hermansky,et al.  Multilingual MLP features for low-resource LVCSR systems , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Bryan Pardo,et al.  REpeating Pattern Extraction Technique (REPET): A Simple Method for Music/Voice Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Quoc V. Le,et al.  Recurrent Neural Networks for Noise Reduction in Robust ASR , 2012, INTERSPEECH.

[15]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[16]  Jean-Philippe Thiran,et al.  Musical Audio Source Separation Based on User-Selected F0 Track , 2012, LVA/ICA.

[17]  Andreas Stolcke,et al.  Cross-Domain and Cross-Language Portability of Acoustic Features Estimated by Multilayer Perceptrons , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[18]  Thomas Fang Zheng,et al.  Noisy training for deep neural networks in speech recognition , 2015, EURASIP Journal on Audio, Speech, and Music Processing.

[19]  Ngoc Thang Vu,et al.  Multilingual bottle-neck features and its application for under-resourced languages , 2012, SLTU.

[20]  Kyogu Lee,et al.  Vocal separation using extended robust principal component analysis with Schatten p/lp-norm and scale compression , 2014, 2014 IEEE International Workshop on Machine Learning for Signal Processing (MLSP).