Unsupervised Stream-Weights Computation in Classification and Recognition Tasks

In this paper, we provide theoretical results on the problem of optimal stream weight selection for the two stream classification problem. It is shown that in the presence of estimation or modeling errors using stream weights can decrease the total classification error. Specifically, we show that stream weights should be selected to be proportional to the feature stream reliability and informativeness. Next, we turn our attention to the problem of unsupervised stream weights computation in real tasks. Based on the theoretical results we propose to use models and ldquoanti-modelsrdquo (class-specific background models) to estimate stream weights. A nonlinear function of the ratio of the inter- to intra-class distance is proposed for stream weight estimation. The resulting unsupervised stream weight estimation algorithm is evaluated on both artificial data and on the problem of audiovisual speech classification. Finally, the proposed algorithm is extended to the problem of audiovisual speech recognition. It is shown that the proposed algorithms achieve results comparable to the supervised minimum-error training approach for classification tasks under most testing conditions.

[1]  Gerasimos Potamianos,et al.  Exploiting lower face symmetry in appearance-based automatic speechreading , 2005, AVSP.

[2]  Keiichi Tokuda,et al.  Audio-visual speech recognition using MCE-based hmms and model-dependent stream weights , 2000, INTERSPEECH.

[3]  Chalapathy Neti,et al.  Stream confidence estimation for audio-visual speech recognition , 2000, INTERSPEECH.

[4]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[5]  Juergen Luettin,et al.  Asynchronous stream modeling for large vocabulary audio-visual speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[6]  Hervé Bourlard,et al.  A mew ASR approach based on independent processing and recombination of partial frequency bands , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[7]  Javier Hernando,et al.  Maximum likelihood weighting of dynamic speech features for CDHMM speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  J.N. Gowdy,et al.  CUAVE: A new audio-visual database for multimodal human-computer interface research , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Biing-Hwang Juang,et al.  Discriminative utterance verification for connected digits recognition , 1995, IEEE Trans. Speech Audio Process..

[10]  Hervé Glotin,et al.  Weighting schemes for audio-visual fusion in speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[11]  Alexandrina Rogozan,et al.  Adaptive determination of audio and visual weights for automatic speech recognition , 1997, AVSP.

[12]  A. Adjoudani,et al.  On the Integration of Auditory and Visual Parameters in an HMM-based ASR , 1996 .

[13]  Hervé Bourlard,et al.  Modeling auxiliary information in Bayesian network based ASR , 2001, INTERSPEECH.

[14]  Gerasimos Potamianos,et al.  Discriminative training of HMM stream exponents for audio-visual speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[15]  Martin Heckmann,et al.  Noise Adaptive Stream Weighting in Audio-Visual Speech Recognition , 2002, EURASIP J. Adv. Signal Process..

[16]  Sadaoki Furui,et al.  A stream-weight optimization method for multi-stream HMMs based on likelihood value normalization , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[17]  Satoshi Nakamura,et al.  Robust bi-modal speech recognition based on state synchronous modeling and stream weight optimization , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Alexandros Potamianos,et al.  Unsupervised Stream Weight Estimation using Anti-Models , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[19]  John S. D. Mason,et al.  Integration of acoustic and visual speech for speaker recognition , 1993, EUROSPEECH.

[20]  Alexandros Potamianos,et al.  Stream Weight Computation for Multi-Stream Classifiers , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[21]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[22]  Satoshi Nakamura,et al.  Stream weight optimization of speech and lip image sequence for audio-visual speech recognition , 2000, INTERSPEECH.

[23]  Juergen Luettin,et al.  Hierarchical discriminant features for audio-visual LVCSR , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).