论文信息 - Temporal integration as a consequence of multi-source decoding

Temporal integration as a consequence of multi-source decoding

How do listeners integrate evidence for speech in order to support reliable identification? Much of our everyday listening experience is set against a background of other sources, and the evidence for a speech target frequently manifests itself as scattered time-frequency islands of high signal-to-noise ratio. Individual fragments, such as formant portions, typically contain insufficient information. However, it appears that relatively few fragments are needed to constrain speech hypotheses to a manageable number (Cooke and Green, forthcoming). Robust speech perception in noise appears possible if we could determine which fragments belong together. A potential solution exploits auditory scene analysis principles (Bregman, 1990; Cooke and Ellis, 2001), which seek to group evidence based on ‘primitive’ processes such as common onset and harmonicity. While computational instantiations of these techniques have been applied to simultaneous fragments with some success, it is hard to see how temporally-disparate elements can be integrated using such constraints. For instance, it has proved difficult to apply interpolation or extrapolation procedures in sequential grouping of formants and harmonics. Another view (Remez et al., 1994) suggests that speech lacks sufficient coherence to enable primitive grouping, and that instead, listeners call upon prior knowledge of speech for successful interpretation. Such ‘schemadriven’ grouping is also part of Bregman's conception, but there it works in concert with primitive processes.

Jon Barker | Martin Cooke | Dan Ellis

[1] David Pearce,et al. The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[2] Daniel P. W. Ellis,et al. The auditory organization of speech and other sources in listeners and computational models , 2001, Speech Commun..

[3] Phil D. Green,et al. Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..