Speech fragment decoding techniques for simultaneous speaker identification and speech recognition

This paper addresses the problem of recognising speech in the presence of a competing speaker. We review a speech fragment decoding technique that treats segregation and recognition as coupled problems. Data-driven techniques are used to segment a spectro-temporal representation into a set of fragments, such that each fragment is dominated by one or other of the speech sources. A speech fragment decoder is used which employs missing data techniques and clean speech models to simultaneously search for the set of fragments and the word sequence that best matches the target speaker model. The paper investigates the performance of the system on a recognition task employing artificially mixed target and masker speech utterances. The fragment decoder produces significantly lower error rates than a conventional recogniser, and mimics the pattern of human performance that is produced by the interplay between energetic and informational masking. However, at around 0dB the performance is generally quite poor. An analysis of the errors shows that a large number of target/masker confusions are being made. The paper presents a novel fragment-based speaker identification approach that allows the target speaker to be reliably identified across a wide range of SNRs. This component is combined with the recognition system to produce significant improvements. When the target and masker utterance have the same gender, the recognition system has a performance at 0dB equal to that of humans; in other conditions the error rate is roughly twice the human error rate.

[1]  R. M. Warren,et al.  Spectral redundancy: Intelligibility of sentences heard through narrow spectral slits , 1995, Perception & psychophysics.

[2]  Jon Barker,et al.  The foreign language cocktail party problem: Energetic and informational masking effects in non-native speech perception. , 2008, The Journal of the Acoustical Society of America.

[3]  Q. Summerfield,et al.  Modeling the perception of concurrent vowels: vowels with different fundamental frequencies. , 1990, The Journal of the Acoustical Society of America.

[4]  C. Darwin AUDITORY GROUPING AND ATTENTION TO SPEECH , 2001 .

[5]  Martin Cooke,et al.  A glimpsing model of speech perception in noise. , 2006, The Journal of the Acoustical Society of America.

[6]  Ning Ma,et al.  Exploiting correlogram structure for robust speech recognition with multiple speech sources , 2007, Speech Commun..

[7]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[8]  Daniel P. W. Ellis,et al.  Decoding speech in the presence of other sources , 2005, Speech Commun..

[9]  Ning Ma,et al.  Recent advances in speech fragment decoding techniques , 2006, INTERSPEECH.

[10]  CookeMartin,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001 .

[11]  Jon Barker,et al.  Soft decisions in missing data techniques for robust automatic speech recognition , 2000, INTERSPEECH.

[12]  B. Shinn-Cunningham,et al.  Note on informational masking (L) , 2003 .

[13]  Tuomas Virtanen,et al.  Speech recognition using factorial hidden Markov models for separation in the feature space , 2006, INTERSPEECH.

[14]  M. Ericson,et al.  Informational and energetic masking effects in the perception of multiple simultaneous talkers. , 2001, The Journal of the Acoustical Society of America.

[15]  R Meddis,et al.  Modeling the identification of concurrent vowels with different fundamental frequencies. , 1992, The Journal of the Acoustical Society of America.

[16]  John R. Hershey,et al.  Super-human multi-talker speech recognition: the IBM 2006 speech separation challenge system , 2006, INTERSPEECH.

[17]  Roger K. Moore Computer Speech and Language , 1986 .

[18]  Jos B. T. M. Roerdink,et al.  The Watershed Transform: Definitions, Algorithms and Parallelization Strategies , 2000, Fundam. Informaticae.

[19]  B. Shinn-Cunningham,et al.  Note on informational masking. , 2003, The Journal of the Acoustical Society of America.

[20]  Alain de Cheveigné,et al.  Separation of concurrent harmonic sounds: Fundamental frequency estimation and a time-domain cancell , 1993 .

[21]  Jon Barker,et al.  An automatic speech recognition system based on the scene analysis account of auditory perception , 2007, Speech Commun..

[22]  Hermann Ney,et al.  Data driven search organization for continuous speech recognition , 1992, IEEE Trans. Signal Process..

[23]  John R. Hershey,et al.  Monaural speech separation and recognition challenge , 2010, Comput. Speech Lang..

[24]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.