Temporal integration as a consequence of multi-source decoding

How do listeners integrate evidence for speech in order to support reliable identification? Much of our everyday listening experience is set against a background of other sources, and the evidence for a speech target frequently manifests itself as scattered time-frequency islands of high signal-to-noise ratio. Individual fragments, such as formant portions, typically contain insufficient information. However, it appears that relatively few fragments are needed to constrain speech hypotheses to a manageable number (Cooke and Green, forthcoming). Robust speech perception in noise appears possible if we could determine which fragments belong together. A potential solution exploits auditory scene analysis principles (Bregman, 1990; Cooke and Ellis, 2001), which seek to group evidence based on ‘primitive’ processes such as common onset and harmonicity. While computational instantiations of these techniques have been applied to simultaneous fragments with some success, it is hard to see how temporally-disparate elements can be integrated using such constraints. For instance, it has proved difficult to apply interpolation or extrapolation procedures in sequential grouping of formants and harmonics. Another view (Remez et al., 1994) suggests that speech lacks sufficient coherence to enable primitive grouping, and that instead, listeners call upon prior knowledge of speech for successful interpretation. Such ‘schemadriven’ grouping is also part of Bregman's conception, but there it works in concert with primitive processes.