Reconstruction techniques for improving the perceptual quality of binary masked speech.

This study proposes an approach to improve the perceptual quality of speech separated by binary masking through the use of reconstruction in the time-frequency domain. Non-negative matrix factorization and sparse reconstruction approaches are investigated, both using a linear combination of basis vectors to represent a signal. In this approach, the short-time Fourier transform (STFT) of separated speech is represented as a linear combination of STFTs from a clean speech dictionary. Binary masking for separation is performed using deep neural networks or Bayesian classifiers. The perceptual evaluation of speech quality, which is a standard objective speech quality measure, is used to evaluate the performance of the proposed approach. The results show that the proposed techniques improve the perceptual quality of binary masked speech, and outperform traditional time-frequency reconstruction approaches.

[1]  Tuomas Virtanen,et al.  Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[3]  Mikkel N. Schmidt,et al.  Linear Regression on Sparse Features for Single-Channel Speech Separation , 2007, 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[4]  Lauren Calandruccio,et al.  Determination of the Potential Benefit of Time-Frequency Gain Manipulation , 2006, Ear and hearing.

[5]  Richard M. Stern,et al.  Reconstruction of missing features for robust speech recognition , 2004, Speech Commun..

[6]  Louis ten Bosch,et al.  Using sparse representations for exemplar based continuous digit recognition , 2009, 2009 17th European Signal Processing Conference.

[7]  Bhiksha Raj,et al.  Sparse Overcomplete Decomposition for Single Channel Speaker Separation , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[8]  S. Frick,et al.  Compressed Sensing , 2014, Computer Vision, A Reference Guide.

[9]  DeLiang Wang,et al.  An algorithm to improve speech recognition in noise for hearing-impaired listeners. , 2013, The Journal of the Acoustical Society of America.

[10]  Michael Elad,et al.  Sparse Representation for Color Image Restoration , 2008, IEEE Transactions on Image Processing.

[11]  Hiroshi Sawada,et al.  Reducing musical noise by a fine-shift overlap-add method applied to source separation using a time-frequency mask , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[12]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[13]  IEEE Recommended Practice for Speech Quality Measurements , 1969, IEEE Transactions on Audio and Electroacoustics.

[14]  Bhiksha Raj,et al.  Speech denoising using nonnegative matrix factorization with priors , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  P. Boersma Praat : doing phonetics by computer (version 5.1.05) , 2009 .

[16]  E. Candès,et al.  Stable signal recovery from incomplete and inaccurate measurements , 2005, math/0503066.

[17]  Y. C. Pati,et al.  Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition , 1993, Proceedings of 27th Asilomar Conference on Signals, Systems and Computers.

[18]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[19]  Hugo Van hamme,et al.  Compressive Sensing for Missing Data Imputation in Noise Robust Speech Recognition , 2010, IEEE Journal of Selected Topics in Signal Processing.

[20]  Guy J. Brown,et al.  Fundamentals of Computational Auditory Scene Analysis , 2006 .

[21]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[22]  Mikkel N. Schmidt Speech Separation using Non-negative Features and Sparse Non-negative Matrix Factorization , 2007 .

[23]  P. Loizou,et al.  Factors influencing intelligibility of ideal binary-masked speech: implications for noise reduction. , 2008, The Journal of the Acoustical Society of America.

[24]  Jort Gemmeke,et al.  Noise robust ASR: Missing data techniques and beyond , 2006 .

[25]  DeLiang Wang,et al.  Exploring Monaural Features for Classification-Based Speech Segregation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Paris Smaragdis,et al.  Non-negative Matrix Factor Deconvolution; Extraction of Multiple Sound Sources from Monophonic Inputs , 2004, ICA.

[27]  Tara N. Sainath,et al.  Exemplar-Based Sparse Representation Features: From TIMIT to LVCSR , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Mike E. Davies,et al.  Compressed Sensing and Source Separation , 2007, ICA.

[29]  Tuomas Virtanen,et al.  Exemplar-Based Sparse Representations for Noise Robust Automatic Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  Richard M. Dansereau,et al.  Monaural speech segregation based on fusion of source-driven with model-driven techniques , 2007, Speech Commun..

[31]  Rainer Martin,et al.  Temporal smoothing of spectral masks in the cepstral domain for speech separation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[32]  J. Eggert,et al.  Sparse coding and NMF , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[33]  DeLiang Wang,et al.  Speech intelligibility in background noise with ideal binary time-frequency masking. , 2009, The Journal of the Acoustical Society of America.

[34]  DeLiang Wang,et al.  Binary and ratio time-frequency masks for robust speech recognition , 2006, Speech Commun..

[35]  Paris Smaragdis,et al.  Convolutive Speech Bases and Their Application to Supervised Speech Separation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[36]  Guillermo Sapiro,et al.  Online dictionary learning for sparse coding , 2009, ICML '09.

[37]  Zhaoshui He,et al.  Extended SMART Algorithms for Non-negative Matrix Factorization , 2006, ICAISC.

[38]  DeLiang Wang,et al.  Towards Scaling Up Classification-Based Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[39]  Seungjin Choi,et al.  Algorithms for orthogonal nonnegative matrix factorization , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[40]  E. Owens,et al.  An Introduction to the Psychology of Hearing , 1997 .

[41]  Tomi Kinnunen,et al.  A Joint Approach for Single-Channel Speaker Identification and Speech Separation , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[42]  D. Kanevsky,et al.  ABCS : Approximate Bayesian Compressed Sensing , 2009 .

[43]  DeLiang Wang,et al.  CASA-Based Robust Speaker Identification , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[44]  Yang Lu,et al.  An algorithm that improves speech intelligibility in noise for normal-hearing listeners. , 2009, The Journal of the Acoustical Society of America.

[45]  DeLiang Wang,et al.  Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation. , 2006, The Journal of the Acoustical Society of America.

[46]  Michael Elad,et al.  Image Denoising Via Learned Dictionaries and Sparse representation , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[47]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[48]  Bert Cranen,et al.  Using sparse representations for missing data imputation in noise robust speech recognition , 2008, 2008 16th European Signal Processing Conference.

[49]  WangDeLiang,et al.  Towards Scaling Up Classification-Based Speech Separation , 2013 .

[50]  Xihong Wu,et al.  Improvement of intelligibility of ideal binary-masked noisy speech by adding background noise. , 2011, The Journal of the Acoustical Society of America.

[51]  Bhiksha Raj,et al.  Non-negative matrix factorization based compensation of music for automatic speech recognition , 2010, INTERSPEECH.

[52]  Guillermo Sapiro,et al.  Online Learning for Matrix Factorization and Sparse Coding , 2009, J. Mach. Learn. Res..

[53]  Michael Elad,et al.  Image Denoising Via Sparse and Redundant Representations Over Learned Dictionaries , 2006, IEEE Transactions on Image Processing.

[54]  DeLiang Wang,et al.  Time-Frequency Masking for Speech Separation and Its Potential for Hearing Aid Design , 2008 .