Single channel speech separation in modulation frequency domain based on a novel pitch range estimation method

Computational Auditory Scene Analysis (CASA) has been the focus in recent literature for speech separation from monaural mixtures. The performance of current CASA systems on voiced speech separation strictly depends on the robustness of the algorithm used for pitch frequency estimation. We propose a new system that estimates pitch (frequency) range of a target utterance and separates voiced portions of target speech. The algorithm, first, estimates the pitch range of target speech in each frame of data in the modulation frequency domain, and then, uses the estimated pitch range for segregating the target speech. The method of pitch range estimation is based on an onset and offset algorithm. Speech separation is performed by filtering the mixture signal with a mask extracted from the modulation spectrogram. A systematic evaluation shows that the proposed system extracts the majority of target speech signal with minimal interference and outperforms previous systems in both pitch extraction and voiced speech separation.

[1]  Guy J. Brown,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2006 .

[2]  Les E. Atlas,et al.  EURASIP Journal on Applied Signal Processing 2003:7, 668–675 c ○ 2003 Hindawi Publishing Corporation Joint Acoustic and Modulation Frequency , 2003 .

[3]  Shlomo Dubnov,et al.  Speech Source Separation in Convolutive Environments Using Space-Time-Frequency Analysis , 2006, EURASIP J. Adv. Signal Process..

[4]  Richard Lippmann,et al.  Speech recognition by machines and humans , 1997, Speech Commun..

[5]  R. Plomp,et al.  Effect of temporal envelope smearing on speech reception. , 1994, The Journal of the Acoustical Society of America.

[6]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[7]  Daniel P. W. Ellis,et al.  Decoding speech in the presence of other sources , 2005, Speech Commun..

[8]  Guy J. Brown,et al.  A multi-pitch tracking algorithm for noisy speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Kuldip K. Paliwal,et al.  Speech Coding and Synthesis , 1995 .

[10]  Qin Li,et al.  Time-variant least squares harmonic modeling , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[11]  A. Zoubir,et al.  EURASIP Journal on Advances in Signal Processing , 2011 .

[12]  A. Bregman Auditory Scene Analysis , 2008 .

[13]  DeLiang Wang,et al.  Model-based sequential organization in cochannel speech , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Hirokazu Kameoka,et al.  Single and Multiple F0 Contour Estimation Through Parametric Spectrogram Modeling of Speech in Noisy Environments , 2007, IEEE Trans. Speech Audio Process..

[15]  Norbert Dillier,et al.  Sound Classification in Hearing Aids Inspired by Auditory Scene Analysis , 2005, EURASIP J. Adv. Signal Process..

[16]  David Talkin,et al.  A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[17]  DeLiang Wang,et al.  Monaural speech segregation based on pitch tracking and amplitude modulation , 2002, IEEE Transactions on Neural Networks.

[18]  Richard M. Dansereau,et al.  A Maximum Likelihood Estimation of Vocal-Tract-Related Filter Characteristics for Single Channel Speech Separation , 2006, EURASIP J. Audio Speech Music. Process..

[19]  Louis D. Braida,et al.  Human and machine consonant recognition , 2005, Speech Commun..

[20]  Guy J. Brown,et al.  Separation of speech from interfering sounds based on oscillatory correlation , 1999, IEEE Trans. Neural Networks.

[21]  DeLiang Wang,et al.  Auditory Segmentation Based on Onset and Offset Analysis , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  D. Wang,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2008, IEEE Trans. Neural Networks.

[23]  DeLiang Wang,et al.  A computational auditory scene analysis system for speech segregation and robust speech recognition , 2010, Comput. Speech Lang..

[24]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..

[25]  Les E. Atlas,et al.  Feasibility of Single Channel Speaker Separation Based on Modulation Frequency Analysis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[26]  Les E. Atlas,et al.  Coherent envelope detection for modulation filtering of speech , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[27]  P LippmannRichard Speech recognition by machines and humans , 1997 .

[28]  Shlomo Dubnov,et al.  Maximum a-posteriori probability pitch tracking in noisy environments using harmonic model , 2004, IEEE Transactions on Speech and Audio Processing.

[29]  DeLiang Wang,et al.  A Tandem Algorithm for Pitch Estimation and Voiced Speech Segregation , 2010, IEEE Transactions on Audio, Speech, and Language Processing.