Interfacing Sound Stream Segregation to Recognition - Preliminar Several Sounds Si

This paper reports the preliminary results of experiments on listening to several sounds at once. ‘Ike issues are addressed: segregating speech streams from a mixture of sounds, and interfacing speech stream segregation with automatic speech recognition (AD). Speech stream segregation (SSS) is modeled as a process of extracting harmonic fragments, grouping these extracted harmonic fragments, and substituting some sounds for non-harmonic parts of groups. This system is implemented by extending the harmonic-based stream segregation system reported at AAAI-94 and IJCAI-95. The main problem in interfacing SSS with HMM-based ASR is how to improve the recognition performance which is degraded by spectral distortion of segregated sounds caused mainly by the binaural input, grouping, and residue substitution. Our solution is to re-train the parameters of the HMM with training data binauralized for four directions, to group harmonic fragments according to their directions, and to substitute the residue of harmonic fragments for non-harmonic parts of each group. Experiments with 500 mixtures of two women’s utterances of a word showed that the cumulative accuracy of word recognition up to the 10th candidate of each woman’s utterance is, on average, 75%.

[1]  P Green,et al.  Computational auditory scene analysis: listening to several things at once. , 1993, Endeavour.

[2]  I. Nelken Demonstrations of Auditory Scene Analysis: The Perceptual Organization of Sound by Albert S. Bregman and Pierre A. Ahad, MIT Press, 1996. £15.95 CD , 1997, Trends in Neurosciences.

[3]  R. M. Warren Perceptual Restoration of Missing Speech Sounds , 1970, Science.

[4]  Tomohiro Nakatani,et al.  Localization by harmonic structure and its application to harmonic sound stream segregation , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[5]  Tomohiro Nakatani,et al.  Cocktail-Party Effect with Computational Auditory Scene Analysis — Preliminary Report — , 1995 .

[6]  Richard J. Mammone,et al.  Introduction to the special issue on neural networks for speech processing , 1994, IEEE Trans. Speech Audio Process..

[7]  Guy J. Brown Computational auditory scene analysis : a representational approach , 1993 .

[8]  M. Bodden Modeling human sound-source localization and the cocktail-party-effect , 1993 .

[9]  S. Handel,et al.  Listening: An Introduction to the Perception of Auditory Events , 1993 .

[10]  Tomohiro Nakatani,et al.  Residue-Driven Architecture for Computational Auditory Scene Analysis , 1995, IJCAI.

[11]  Ramdas Kumaresan,et al.  Voiced-speech analysis based on the residual interfering signal canceler (RISC) algorithm , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  R. W. Stadler,et al.  On the potential of fixed arrays for hearing aids , 1993 .

[13]  Sadaoki Furui,et al.  A maximum likelihood procedure for a universal adaptation method based on HMM composition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[14]  G. Kramer Auditory Scene Analysis: The Perceptual Organization of Sound by Albert Bregman (review) , 2016 .

[15]  Phil D. Green,et al.  Auditory scene analysis and hidden Markov model recognition of speech in noise , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[16]  S. Handel Listening As Introduction to the Perception of Auditory Events , 1989 .

[17]  Tomohiro Nakatani,et al.  Auditory Stream Segregation in Auditory Scene Analysis with a Multi-Agent System , 1994, AAAI.

[18]  Victor R. Lesser,et al.  IPUS: An Architecture for Integrated Signal Processing and Signal Interpretation in Complex Environments , 1993, AAAI.

[19]  W. M. Rabinowitz,et al.  On the potential of fixed arrays for hearing aids. , 1993, The Journal of the Acoustical Society of America.

[20]  J. Blauert Spatial Hearing: The Psychophysics of Human Sound Localization , 1983 .

[21]  Tomohiro Nakatani,et al.  A computational model of sound stream segregation with multi-agent paradigm , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.