Comparison of neural architectures for sensor fusion

For technical speech recognition systems as well as for humans it has been shown that the combination of acoustic and optic information can enhance speech recognition performance. But it still remains an open question, at which stage of processing the two information channels should be combined. We systematically investigate this problem by means of a neural speech recognition system applied to monosyllabic words. Different fusion architectures of multilayer perceptrons are compared both for noiseless and noisy acoustic data. Furthermore, different modularized neural architectures are examined for the acoustic channel alone. The results corroborate the idea of separate processing of the two channels until the final stage of classification.

[1]  E. Petajan,et al.  An improved automatic lipreading system to enhance speech recognition , 1988, CHI '88.

[2]  S. Furui On the role of spectral transition for speech perception. , 1986, The Journal of the Acoustical Society of America.

[3]  C. Benoît,et al.  Effects of phonetic context on audio-visual intelligibility of French. , 1994, Journal of speech and hearing research.

[4]  A. Macleod,et al.  A procedure for measuring auditory and audio-visual speech-reception thresholds for sentences in noise: rationale, evaluation, and recommendations for use. , 1990, British journal of audiology.

[5]  Ali Adjoudani,et al.  Audio-visual speech recognition compared across two architectures , 1995, EUROSPEECH.

[6]  Alexander H. Waibel,et al.  See Me, Hear Me: Integrating Automatic Speech Recognition and Lip-reading , 1994 .

[7]  Javier R. Movellan,et al.  Visual Speech Recognition with Stochastic Networks , 1994, NIPS.

[8]  Terrence J. Sejnowski,et al.  Neural network models of sensory integration for improved vowel recognition , 1990, Proc. IEEE.

[9]  Yochai Konig,et al.  "Eigenlips" for robust speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  M.E. Hennecke,et al.  Automatic speech recognition system using acoustic and visual signals , 1995, Conference Record of The Twenty-Ninth Asilomar Conference on Signals, Systems and Computers.

[11]  Günther Palm,et al.  Associative Memory Networks and Sparse Similarity Preserving Codes , 1994 .

[12]  Dominic W. Massaro,et al.  Cross-linguistic comparisons in the integration of visual and auditory speech , 1995, Memory & cognition.

[13]  N. P. Erber Interaction of audition and vision in the recognition of oral speech stimuli. , 1969, Journal of speech and hearing research.