Monaural source separation based on adaptive discriminative criterion in neural networks

Monaural source separation is an important research area which can help to improve the performance of several real-world applications, such as speech recognition and assisted living systems. Huang et al. proposed deep recurrent neural networks (DRNNs) with discriminative criterion objective function to improve the performance of source separation. However, the penalty factor in the objective function is selected randomly and empirically. Therefore, we introduce an approach to calculate the parameter in the discriminative term adaptively via the discrepancy between target features. The penalty factor can be changed with inputs to improve the separation performance. The proposed method is evaluated with different settings and architectures of neural networks. In these experiments, the TIMIT corpus is explored as the database and the signal to distortion ratio (SDR) as the measurement. Comparing with the previous approach, our method has improved robustness and a better separation performance.

[1]  R. Fletcher,et al.  Practical Methods of Optimization: Fletcher/Practical Methods of Optimization , 2000 .

[2]  Miao Yu,et al.  A Posture Recognition-Based Fall Detection System for Monitoring an Elderly Person in a Smart Home Environment , 2012, IEEE Transactions on Information Technology in Biomedicine.

[3]  Shih-Chii Liu,et al.  Impact of low-precision deep regression networks on single-channel source separation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Jun Du,et al.  Unsupervised single-channel speech separation via deep neural network for different gender mixtures , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[5]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Chin-Hui Lee,et al.  A Reverberation-Time-Aware Approach to Speech Dereverberation Based on Deep Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Jonathon A. Chambers,et al.  Audiovisual Speech Source Separation: An overview of key methodologies , 2014, IEEE Signal Processing Magazine.

[9]  Paris Smaragdis,et al.  Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Benjamin Schrauwen,et al.  Training and Analysing Deep Recurrent Neural Networks , 2013, NIPS.

[11]  DeLiang Wang,et al.  Binary and ratio time-frequency masks for robust speech recognition , 2006, Speech Commun..

[12]  Jonathon A. Chambers,et al.  Video-Aided Model-Based Source Separation in Real Reverberant Rooms , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Paris Smaragdis,et al.  Deep learning for monaural speech separation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Philipp Birken,et al.  Numerical Linear Algebra , 2011, Encyclopedia of Parallel Computing.

[15]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[16]  Miao Yu,et al.  A Multimodal Approach to Blind Source Separation of Moving Sources , 2010, IEEE Journal of Selected Topics in Signal Processing.

[17]  DeLiang Wang,et al.  A Deep Ensemble Learning Method for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Jonathon A. Chambers,et al.  Underdetermined source separation using time-frequency masks and an adaptive combined Gaussian-Student's t probabilistic model , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[20]  Roger Fletcher,et al.  Practical methods of optimization; (2nd ed.) , 1987 .

[21]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[22]  Bhiksha Raj,et al.  A Probabilistic Latent Variable Model for Acoustic Modeling , 2006 .

[23]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[24]  Daniel P. W. Ellis,et al.  Model-Based Expectation-Maximization Source Separation and Localization , 2010, IEEE Transactions on Audio, Speech, and Language Processing.