Soft harmonic masks for recognising speech in the presence of a competing speaker

The paper addresses the problem of recognising speech in the presence of a competing speaker. It uses a two stage ‘Speech Fragment Decoding’ system. The system works by first segmenting a spectro-temporal representation of the mixture into a number of fragments, such that each fragment is dominated by a single source. An ASR search is then extended to find the combination of speech model sequence and fragment subset that best fits a set of clean speech models. This paper extends previous work by combining ‘Speech Fragment Decoding’ with soft missing data techniques to better handle spectro-temporal regions that cannot be confidently ascribed to either foreground or background. Recognition experiments are performed on a connected digit task using 0 db mixtures of simultaneous mixedgender speakers. The incorporation of soft decisions leads to an increase in system performance from 66.9% to 72.2%.

[1]  Jon Barker,et al.  Recognising speech in the presence of a competing speaker using a 'speech fragment decoder' , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[2]  Daniel P. W. Ellis,et al.  Decoding speech in the presence of other sources , 2005, Speech Commun..

[3]  Guy J. Brown,et al.  A comparison of auditory and blind separation techniques for speech segregation , 2001, IEEE Trans. Speech Audio Process..

[4]  Jos B. T. M. Roerdink,et al.  The Watershed Transform: Definitions, Algorithms and Parallelization Strategies , 2000, Fundam. Informaticae.

[5]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[6]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[7]  Guy J. Brown,et al.  A multi-pitch tracking algorithm for noisy speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  E. C. Cherry Some Experiments on the Recognition of Speech, with One and with Two Ears , 1953 .

[9]  Roger K. Moore,et al.  Hidden Markov model decomposition of speech and noise , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[10]  G. A. Miller The masking of speech. , 1947, Psychological bulletin.

[11]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[12]  Jon Barker,et al.  Robust ASR based on clean speech models: an evaluation of missing data techniques for connected digit recognition in noise , 2001, INTERSPEECH.