A theory and computational model of auditory monaural sound separation
暂无分享,去创建一个
This thesis presents both a conceptual theory of how the auditory system uses monaural acoustic information to separate two simultaneous talkers and a computer model which is based on this theory. The information we believe the auditory system uses to separate sounds is reviewed, and a method to use this information for separating sounds is hypothesized. The computer model hypothesizes how the auditory system might use monaural acoustic information to determine how many sounds are present and the characteristics of each sound source.
The use of periodicity information for the separation of sounds focuses on how experimental results about the neural encoding of sounds are combined with Licklider's theory of pitch perception. It is hypothesized that this acoustic information is interpreted by the auditory system using a two stage process: local regions in frequency and time are assigned to an intermediate representation called a "group-object", and these group-objects are then subsequently assigned to a "sound-stream". It is also hypothesized that the local encoding of periodicity information is used by the auditory system to assign local frequency-time regions to group-objects with similar periodic features. A group-object is then assigned to one of the sound streams present based on the sound source that the auditory system believes generated this acoustic segment.
A computer model which uses acoustic information to separate two simultaneous talkers is presented. The information present in the "fine time stucture" of a cochlear model's filterbank output is used as the input to the separation system. The computer model determines: how many people are speaking, whether each person's voice can be classified as "periodic" or "nonperiodic", and what the spectral estimate for each talker is. The separation system uses both time and frequency continuity constraints in the modeling of each person's voice. Examples of how the system separates a male and female voice speaking a string of continuous digits are presented along with an evaluation of the current implementation.