A source/filter model with adaptive constraints for NMF-based speech separation

This paper introduces a constrained source/filter model for semi-supervised speech separation based on non-negative matrix factorization (NMF). The objective is to inform NMF with prior knowledge about speech, providing a physically meaningful speech separation. To do so, a source/filter model (indicated as Instantaneous Mixture Model or IMM) is integrated in the NMF. Furthermore, constraints are added to the IMM-NMF, in order to control the NMF behaviour during separation, and to enforce its physical meaning. In particular, a speech specific constraint - based on the source/filter coherence of speech - and a method for the automatic adaptation of constraints' weights during separation are presented. Also, the proposed source/filter model is semi-supervised: during training, one filter basis is estimated for each phoneme of a speaker; during separation, the estimated filter bases are then used in the constrained source/filter model. An experimental evaluation for speech separation was conducted on the TIMIT speakers database mixed with various environmental background noises from the QUT-NOISE database. This evaluation showed that the use of adaptive constraints increases the performance of the source/filter model for speaker-dependent speech separation, and compares favorably to fully-supervised speech separation.

[1]  Paris Smaragdis,et al.  A Non-negative Approach to Language Informed Speech Separation , 2012, LVA/ICA.

[2]  Jonathan Le Roux,et al.  Deep NMF for speech separation , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  A. Klapuri,et al.  Analysis of polyphonic audio using source-filter model and non-negative matrix factorization , 2006 .

[4]  Bhiksha Raj,et al.  Active-Set Newton Algorithm for Overcomplete Non-Negative Representations of Audio , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Gaël Richard,et al.  Main instrument separation from stereophonic audio signals using a source/filter model , 2009, 2009 17th European Signal Processing Conference.

[7]  Antoine Liutkus,et al.  An overview of informed audio source separation , 2013, 2013 14th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS).

[8]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[9]  Sridha Sridharan,et al.  The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms , 2010, INTERSPEECH.

[10]  Jérôme Idier,et al.  Algorithms for Nonnegative Matrix Factorization with the β-Divergence , 2010, Neural Computation.

[11]  Gautham J. Mysore,et al.  Universal speech models for speaker independent single channel source separation , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[13]  Stan Z. Li,et al.  Learning spatially localized, parts-based representation , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[14]  Tuomas Virtanen,et al.  Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Victor Zue,et al.  Speech database development at MIT: Timit and beyond , 1990, Speech Commun..

[16]  Axel Röbel,et al.  Improving Lpc Spectral Envelope Extraction Of Voiced Speech By True-Envelope Estimation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[17]  Björn W. Schuller,et al.  A comparative study on sparsity penalties for NMF-based speech separation: Beyond LP-norms , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Rémi Gribonval,et al.  Audio source separation with a single sensor , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Axel Röbel,et al.  On automatic drum transcription using non-negative matrix deconvolution and itakura saito divergence , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Bhiksha Raj,et al.  Active-set newton algorithm for non-negative sparse coding of audio , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Alexey Ozerov,et al.  Text-Informed Audio Source Separation. Example-Based Approach Using Non-Negative Matrix Partial Co-Factorization , 2014, Journal of Signal Processing Systems.

[22]  Gautham J. Mysore,et al.  Speaker and noise independent online single-channel speech enhancement , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).