Cochlea-scaled spectral entropy predicts rate-invariant intelligibility of temporally distorted sentences.

Some evidence, mostly drawn from experiments using only a single moderate rate of speech, suggests that low-frequency amplitude modulations may be particularly important for intelligibility. Here, two experiments investigated intelligibility of temporally distorted sentences across a wide range of simulated speaking rates, and two metrics were used to predict results. Sentence intelligibility was assessed when successive segments of fixed duration were temporally reversed (exp. 1), and when sentences were processed through four third-octave-band filters, the outputs of which were desynchronized (exp. 2). For both experiments, intelligibility decreased with increasing distortion. However, in exp. 2, intelligibility recovered modestly with longer desynchronization. Across conditions, performances measured as a function of proportion of utterance distorted converged to a common function. Estimates of intelligibility derived from modulation transfer functions predict a substantial proportion of the variance in listeners' responses in exp. 1, but fail to predict performance in exp. 2. By contrast, a metric of potential information, quantified as relative dissimilarity (change) between successive cochlear-scaled spectra, is introduced. This metric reliably predicts listeners' intelligibility across the full range of speaking rates in both experiments. Results support an information-theoretic approach to speech perception and the significance of spectral change rather than physical units of time.

[1]  Christian E Stilp,et al.  Auditory color constancy: Calibration to reliable spectral properties across nonspeech context and targets , 2010, Attention, perception & psychophysics.

[2]  R. Plomp,et al.  Effect of temporal envelope smearing on speech reception. , 1994, The Journal of the Acoustical Society of America.

[3]  P. Boersma Praat : doing phonetics by computer (version 4.4.24) , 2006 .

[4]  Brian R Glasberg,et al.  Derivation of auditory filter shapes from notched-noise data , 1990, Hearing Research.

[5]  Q. Summerfield,et al.  Modeling the perception of concurrent vowels: vowels with the same fundamental frequency. , 1989, The Journal of the Acoustical Society of America.

[6]  T Houtgast,et al.  A physical method for measuring speech-transmission quality. , 1980, The Journal of the Acoustical Society of America.

[7]  Eric W Healy,et al.  Effect of spectral frequency range and separation on the perception of asynchronous speech. , 2007, The Journal of the Acoustical Society of America.

[8]  Michael Kiefte,et al.  Sensitivity to change in perception of speech , 2003, Speech Commun..

[9]  Peter F. Assmann,et al.  The Perception of Speech Under Adverse Conditions , 2004 .

[10]  Steven Greenberg,et al.  Speech intelligibility derived from exceedingly sparse spectral information , 1998, ICSLP.

[11]  A. Liberman,et al.  Some effects of later-occurring information on the perception of stop consonant and semivowel , 1979, Perception & psychophysics.

[12]  Joshua M Alexander,et al.  Spectral tilt change in stop consonant perception. , 2008, The Journal of the Acoustical Society of America.

[13]  Laricia Longworth-Reed,et al.  Time-forward speech intelligibility in time-reversed rooms. , 2009, The Journal of the Acoustical Society of America.

[14]  Q J Fu,et al.  Recognition of spectrally asynchronous speech by normal-hearing listeners and Nucleus-22 cochlear implant users. , 2001, The Journal of the Acoustical Society of America.

[15]  W. Ainsworth The Influence of Precursive Sequences on the Perception of Synthesized Vowels , 1974, Language and speech.

[16]  C. Darwin,et al.  The Quarterly Journal of Experimental Psychology Section a Human Experimental Psychology Perceptual Grouping of Speech Components Differing in Fundamental Frequency and Onset-time Perceptual Grouping of Speech Components Differing in Fundamental Frequency and Onset-time , 2022 .

[17]  Christian E Stilp,et al.  Cochlea-scaled entropy, not consonants, vowels, or time, best predicts speech intelligibility , 2010, Proceedings of the National Academy of Sciences.

[18]  J L Miller,et al.  Effect of Speaking Rate on the Perception of Vowels , 1990, Phonetica.

[19]  Ralph Linsker,et al.  Self-organization in a perceptual network , 1988, Computer.

[20]  Steven Greenberg,et al.  The relation between speech intelligibility and the complex modulation spectrum , 2001, INTERSPEECH.

[21]  J. C. Steinberg,et al.  Factors Governing the Intelligibility of Speech Sounds , 1945 .

[22]  T. Houtgast,et al.  The Modulation Transfer Function in Room Acoustics as a Predictor of Speech Intelligibility , 1973 .

[23]  Daria F. Ferro,et al.  Asynchrony tolerance in the perceptual organization of speech , 2008, Psychonomic bulletin & review.

[24]  Keith R. Kluender,et al.  Perception of Speech Sounds , 2008 .

[25]  T. Houtgast,et al.  A review of the MTF concept in room acoustics and its use for estimating speech intelligibility in auditoria , 1985 .

[26]  J. L. Miller,et al.  Effects of speaking rate and lexical status on phonetic perception. , 1988, Journal of experimental psychology. Human perception and performance.

[27]  Steven Greenberg,et al.  Physiology of the Cochlear Nuclei , 1992 .

[28]  Peter D. Eimas,et al.  Perspectives on the study of speech , 1981 .

[29]  R. Plomp,et al.  Effect of reducing slow temporal modulations on speech reception. , 1994, The Journal of the Acoustical Society of America.

[30]  K. Kluender Speech perception within a biologically realistic information‐theoretic framework. , 2008 .

[31]  W. S. Rhode,et al.  Encoding of amplitude modulation in the cochlear nucleus of the cat. , 1994, Journal of neurophysiology.

[32]  Steven Greenberg,et al.  Temporal constraints on speech intelligibility as deduced from exceedingly sparse spectral representations , 1999, EUROSPEECH.

[33]  S. Soli,et al.  Development of the Hearing in Noise Test for the measurement of speech reception thresholds in quiet and in noise. , 1994, The Journal of the Acoustical Society of America.

[34]  J. L. Flanagan Effect of Delay Distortion upon the Intelligibility and Quality of Speech , 1951 .

[35]  K. Saberi,et al.  Cognitive restoration of reversed speech , 1999, Nature.

[36]  Steven Greenberg,et al.  Speaking in shorthand - A syllable-centric perspective for understanding pronunciation variation , 1999, Speech Commun..

[37]  W. A. Ainsworth,et al.  Duration as a Cue in the Recognition of Synthetic Vowels , 1972 .

[38]  Michael Kiefte,et al.  Absorption of reliable spectral characteristics in auditory perception. , 2008, The Journal of the Acoustical Society of America.

[39]  Michael Kiefte,et al.  Chapter 6 – Speech Perception within a Biologically Realistic Information-Theoretic Framework , 2006 .

[40]  R V Shannon,et al.  Speech Recognition with Primarily Temporal Cues , 1995, Science.

[41]  Steven Greenberg,et al.  Speech intelligibility in the presence of cross-channel spectral asynchrony , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[42]  R. Patterson,et al.  The deterioration of hearing with age: frequency selectivity, the critical ratio, the audiogram, and speech threshold. , 1982, The Journal of the Acoustical Society of America.

[43]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[44]  Peter Ladefoged,et al.  On the Fusion of Sounds Reaching Different Sense Organs , 1957 .

[45]  Steven Greenberg,et al.  What are the Essential Cues for Understanding Spoken Language? , 2001, IEICE Trans. Inf. Syst..

[46]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .