Speech reception in noise: How much do we understand?

In order to better understand the effect of hearing impairment on speech perception in everyday listening situations as well as the still limited benefit of modern hearing instruments in this situations, a thorough understanding of the underlying mechanisms and factors in uencing speech reception in noise is highly desirable. This contribution therefore reviews a series of studies by our group to model speech reception in normal and hearing-impaired listeners in a multidisciplinary approach using “classical” speech intelligibility models, functional perception models, automatic speech recognition (ASR) technology, as well as inputs from psycholinguistics. Classical speech-information-based models like the Articulation Index or speech intelligibility index (SII) describe the acoustical layer and yield accurate predictions only for average intelligibility scores and for a limited set of acoustical situations. With appropriate extentions they can model more audibility-driven and even time-dependent acoustical situations, such as, e.g. the effect of hearing impairment in fluctuating noise. However, to describe the sensory layer and suprathreshold processing de cits in humans, the combination of a psychoacoustically motivated preprocessing model with a pattern recognition algorithm adopted from ASR technology appears advantageous. It allows a detailed analysis of phoneme confusions and the “man-machine-gap” of approx. 12 dB in SNR, i.e., the superiority of human world-knowledge-driven (top-down) speech pattern recognition in comparison to the training-data-driven (bottom-up) machine learning approaches. Finally, the cognitive abilities of human listeners when understanding speech can be assessed by a “fair” comparison between Human Speech recognition and ASR that employs only a limited set of training data. In summary, both bottom-up and top-down strategies have to be assumed when trying to understand speech reception in noise. Computer models that assume a near-to-perfect “world knowledge”, i.e., anticipation of the speech unit to be recognized, can surprisingly well predict the performance of human listeners in noise and may prove to be a useful tool in hearing aid development.