Distinctive feature fusion for improved audio-visual phoneme recognition

Auditory and visual signals provide complementary information but few applications successfully combine the two sources. We consider a distinctive feature approach to Audio Visual Automatic Speech Recognition (AV-ASR) in which features appropriate to each modality are employed, and demonstrate that in the absence of knowledge about the noise the modality-specific approach is best. However even information from the non-preferred modality can be usefully employed if the environmental context (e.g. SNR) is accounted for by adaptively weighting each modality. Future research is focusing on deriving these distinctive feature automatically from data rather than using those proposed by linguists.

[1]  P. Jusczyk,et al.  A precursor of language acquisition in young infants , 1988, Cognition.

[2]  Partha Niyogi,et al.  Feature based representation for audio-visual speech recognition , 1999, AVSP.

[3]  Trent W. Lewis,et al.  Sensor Fusion Weighting Measures in Audio-Visual Speech Recognition , 2004, ACSC.

[4]  Jan Van der Spiegel,et al.  An acoustic-phonetic feature-based system for automatic phoneme recognition in continuous speech , 1999, ISCAS'99. Proceedings of the 1999 IEEE International Symposium on Circuits and Systems VLSI (Cat. No.99CH36349).

[5]  Paul Mineiro,et al.  Robust Sensor Fusion: Analysis and Application to Audio Visual Speech Recognition , 1998, Machine Learning.

[6]  M. Halle,et al.  Preliminaries to Speech Analysis: The Distinctive Features and Their Correlates , 1961 .

[7]  Q. Summerfield Some preliminaries to a comprehensive account of audio-visual speech perception. , 1987 .

[8]  Roland Göcke,et al.  The audio-video australian English speech data corpus AVOZES , 2012, INTERSPEECH.

[9]  Partha Niyogi,et al.  The voicing feature for stop consonants: recognition experiments with continuously spoken alphabets , 2003, Speech Commun..

[10]  Allen A. Montgomery,et al.  Erratum: Effects of Training on the Visual Recognition of Consonants , 1977 .

[11]  G. A. Miller,et al.  Erratum: An Analysis of Perceptual Confusions Among Some English Consonants [J. Acoust. Soc. Am. 27, 339 (1955)] , 1955 .

[12]  David M. W. Powers,et al.  Machine Learning of Natural Language , 1989 .

[13]  Albert S. Bregman,et al.  The Auditory Scene. (Book Reviews: Auditory Scene Analysis. The Perceptual Organization of Sound.) , 1990 .

[14]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[15]  Frédéric Berthommier Audio-visual recognition of spectrally reduced speech , 2001, AVSP.

[16]  Trent W. Lewis,et al.  Audio-Visual Speech Recognition Using Red Exclusion and Neural Networks , 2002, ACSC.