Monaural speech segregation using synthetic speech signals.

When listening to natural speech, listeners are fairly adept at using cues such as pitch, vocal tract length, prosody, and level differences to extract a target speech signal from an interfering speech masker. However, little is known about the cues that listeners might use to segregate synthetic speech signals that retain the intelligibility characteristics of speech but lack many of the features that listeners normally use to segregate competing talkers. In this experiment, intelligibility was measured in a diotic listening task that required the segregation of two simultaneously presented synthetic sentences. Three types of synthetic signals were created: (1) sine-wave speech (SWS); (2) modulated noise-band speech (MNB); and (3) modulated sine-band speech (MSB). The listeners performed worse for all three types of synthetic signals than they did with natural speech signals, particularly at low signal-to-noise ratio (SNR) values. Of the three synthetic signals, the results indicate that SWS signals preserve more of the voice characteristics used for speech segregation than MNB and MSB signals. These findings have implications for cochlear implant users, who rely on signals very similar to MNB speech and thus are likely to have difficulty understanding speech in cocktail-party listening environments.

[1]  B. Shinn-Cunningham,et al.  Note on informational masking. , 2003, The Journal of the Acoustical Society of America.

[2]  Alain de Cheveigné,et al.  Separation of concurrent harmonic sounds: Fundamental frequency estimation and a time-domain cancell , 1993 .

[3]  D. Pisoni,et al.  Speech perception without traditional speech cues. , 1981, Science.

[4]  Jon Barker,et al.  Is the sine-wave speech cocktail party worth attending? , 1999, Speech Commun..

[5]  Robert E. Remez,et al.  Perceiving the sex and identity of a talker without natural vocal timbre , 1997, Perception & psychophysics.

[6]  W. T. Nelson,et al.  A speech corpus for multitalker communications research. , 2000, The Journal of the Acoustical Society of America.

[7]  Michael K. Qin,et al.  Effects of simulated cochlear-implant processing on speech reception in fluctuating maskers. , 2003, The Journal of the Acoustical Society of America.

[8]  D S Brungart,et al.  Informational and energetic masking effects in the perception of two simultaneous talkers. , 2001, The Journal of the Acoustical Society of America.

[9]  Q. Summerfield,et al.  Modeling the perception of concurrent vowels: vowels with different fundamental frequencies. , 1990, The Journal of the Acoustical Society of America.

[10]  M. Dorman,et al.  Speech intelligibility as a function of the number of channels of stimulation for signal processors using sine-wave and noise-band outputs. , 1997, The Journal of the Acoustical Society of America.

[11]  S. G. Nooteboom,et al.  Intonation and the perceptual separation of simultaneous voices , 1982 .

[12]  B. Shinn-Cunningham,et al.  Note on informational masking (L) , 2003 .

[13]  M. Ericson,et al.  Informational and energetic masking effects in the perception of multiple simultaneous talkers. , 2001, The Journal of the Acoustical Society of America.

[14]  E. C. Cherry Some Experiments on the Recognition of Speech, with One and with Two Ears , 1953 .

[15]  Fan-Gang Zeng,et al.  Cochlear implant speech recognition with speech maskers. , 2004, The Journal of the Acoustical Society of America.

[16]  Douglas S Brungart,et al.  Across-ear interference from parametrically degraded synthetic speech signals in a dichotic cocktail-party listening task. , 2005, The Journal of the Acoustical Society of America.

[17]  R. Remez,et al.  Perceptual Organization of Speech , 2008, The Handbook of Speech Perception.

[18]  C. Darwin,et al.  Perceptual separation of simultaneous vowels: within and across-formant grouping by F0. , 1993, The Journal of the Acoustical Society of America.

[19]  Jennifer M. Fellowes,et al.  Learning to recognize talkers from natural, sinewave, and reversed speech samples. , 2002, Journal of experimental psychology. Human perception and performance.

[20]  C. Darwin,et al.  The Quarterly Journal of Experimental Psychology Section a Human Experimental Psychology Perceptual Grouping of Speech Components Differing in Fundamental Frequency and Onset-time Perceptual Grouping of Speech Components Differing in Fundamental Frequency and Onset-time , 2022 .

[21]  T W Tillman,et al.  Perceptual masking in multiple sound backgrounds. , 1969, The Journal of the Acoustical Society of America.

[22]  G. Kidd,et al.  The effect of spatial separation on informational and energetic masking of speech. , 2002, The Journal of the Acoustical Society of America.

[23]  R V Shannon,et al.  Speech Recognition with Primarily Temporal Cues , 1995, Science.

[24]  C. Darwin,et al.  Effects of fundamental frequency and vocal-tract length changes on attention to one of two simultaneous talkers. , 2003, The Journal of the Acoustical Society of America.

[25]  Douglas S Brungart,et al.  Within-ear and across-ear interference in a cocktail-party listening task. , 2002, The Journal of the Acoustical Society of America.

[26]  William M. Rabinowitz,et al.  Better speech recognition with cochlear implants , 1991, Nature.

[27]  R L Freyman,et al.  Spatial release from informational masking in speech recognition. , 2001, The Journal of the Acoustical Society of America.

[28]  E. C. Cmm,et al.  on the Recognition of Speech, with , 2008 .

[29]  Julio González,et al.  Gender and speaker identification as a function of the number of channels in spectrally reduced speech. , 2005, The Journal of the Acoustical Society of America.

[30]  Zachary M. Smith,et al.  Chimaeric sounds reveal dichotomies in auditory perception , 2002, Nature.

[31]  Gerald Kidd,et al.  Informational masking caused by contralateral stimulation. , 2003, The Journal of the Acoustical Society of America.

[32]  E. Carterette,et al.  Some Factors Affecting Multi‐Channel Listening , 1954 .