Information-preserving temporal reallocation of speech in the presence of fluctuating maskers

How can speech be retimed so as to maximise its intelligibility in the face of competing speech? We present a general strategy which modifies local speech rate to minimise overlap with a known fluctuating masker. Continuous time-scale factors are derived in an optimisation procedure which seeks to minimise overall energetic masking of the speech by the masker while additionally unmasking those speech regions potentially most important for speech recognition. Intelligibility increases are evaluated with both objective and subjective measures and show significant gains over an unmodified baseline, with larger benefits at lower signal-to-noise ratios. The retiming approach does not lead to benefits for speech mixed with stationary maskers, suggesting that the gains observed for the fluctuating masker are not simply due to durational expansion.

[1]  R. H. Bernacki,et al.  Effects of noise on speech production: acoustic and perceptual analyses. , 1988, The Journal of the Acoustical Society of America.

[2]  Matthew H. Davis,et al.  Speech recognition in adverse conditions: A review , 2012 .

[3]  Mark J. F. Gales,et al.  Robust continuous speech recognition using parallel model combination , 1996, IEEE Trans. Speech Audio Process..

[4]  William A. Ainsworth,et al.  Effects of preceding noise on the perception of voiced plosives , 2005 .

[5]  Jae Hee Lee,et al.  Contribution of consonant versus vowel information to sentence intelligibility for young normal-hearing and elderly hearing-impaired listeners. , 2007, The Journal of the Acoustical Society of America.

[6]  S. Rosen Temporal information in speech: acoustic, auditory and linguistic aspects. , 1992, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[7]  B C Moore,et al.  The shape of the ear's temporal window. , 1988, The Journal of the Acoustical Society of America.

[8]  Daniel Fogerty,et al.  The role of vowel and consonant fundamental frequency, envelope, and temporal fine structure cues to the intelligibility of words and sentences. , 2012, The Journal of the Acoustical Society of America.

[9]  A Wingfield,et al.  Prosodic features and the intelligibility of accelerated speech: syntactic versus periodic segmentation. , 1984, Journal of speech and hearing research.

[10]  Roger K. Moore,et al.  Hidden Markov model decomposition of speech and noise , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[11]  Martin Cooke,et al.  Effects of the availability of visual information and presence of competing conversations on speech production , 2012, INTERSPEECH.

[12]  Yonghong Yan,et al.  The contribution of consonants versus vowels to word recognition in fluent speech , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[13]  Vincent Aubanel,et al.  Strategies adopted by talkers faced with fluctuating and competing-speech maskers. , 2013, The Journal of the Acoustical Society of America.

[14]  Martin Cooke,et al.  Spectral and temporal changes to speech produced in the presence of energetic and informational maskers. , 2010, The Journal of the Acoustical Society of America.

[15]  Cassia Valentini-Botinhao,et al.  Intelligibility-enhancing speech modifications: the hurricane challenge , 2020, INTERSPEECH.

[16]  J C Junqua,et al.  The Lombard reflex and its role on human listeners and automatic speech recognizers. , 1993, The Journal of the Acoustical Society of America.

[17]  Roy D. Patterson,et al.  SVOS final report : The auditory filterbank , 1988 .

[18]  Christian E Stilp,et al.  Cochlea-scaled entropy, not consonants, vowels, or time, best predicts speech intelligibility , 2010, Proceedings of the National Academy of Sciences.

[19]  D S Brungart,et al.  Informational and energetic masking effects in the perception of two simultaneous talkers. , 2001, The Journal of the Acoustical Society of America.

[20]  Daniel Fogerty,et al.  Perceptual contributions of the consonant-vowel boundary to sentence intelligibility. , 2009, The Journal of the Acoustical Society of America.

[21]  Yan Tang,et al.  Subjective and Objective Evaluation of Speech Intelligibility Enhancement Under Constant Energy and Duration Constraints , 2011, INTERSPEECH.

[22]  Alex Waibel Prosodic knowledge sources for word hypothesization in a continuous speech recognition system , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  Yannis Stylianou,et al.  Evaluating the intelligibility benefit of speech modifications in known noise conditions , 2013, Speech Commun..

[24]  Martin Cooke,et al.  Modelling auditory processing and organisation , 1993, Distinguished dissertations in computer science.

[25]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[26]  Martin Cooke,et al.  A glimpsing model of speech perception in noise. , 2006, The Journal of the Acoustical Society of America.