Multimodal speech processing using asynchronous Hidden Markov Models

Abstract This paper advocates that for some multimodal tasks involving more than one stream of data representing the same sequence of events, it might sometimes be a good idea to be able to desynchronize the streams in order to maximize their joint likelihood. We thus present a novel Hidden Markov Model architecture to model the joint probability of pairs of asynchronous sequences describing the same sequence of events. An Expectation–Maximization algorithm to train the model is presented, as well as a Viterbi decoding algorithm, which can be used to obtain the optimal state sequence as well as the alignment between the two sequences. The model was tested on two audio–visual speech processing tasks, namely speech recognition and text-dependent speaker verification, both using the M2VTS database. Robust performances under various noise conditions were obtained in both cases.

[1]  Samy Bengio,et al.  Evaluation of Biometric Technology on XM2VTS , 2001 .

[2]  Luc Vandendorpe,et al.  The M2VTS Multimodal Face Database (Release 1.00) , 1997, AVBPA.

[3]  Juergen Luettin,et al.  Audio-Visual Speech Modelling for Continuous Speech Recognition , 2000 .

[4]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[5]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[6]  Ronald A. Cole,et al.  New telephone speech corpora at CSLU , 1995, EUROSPEECH.

[7]  Samy Bengio,et al.  An EM Algorithm for Asynchronous Input/Output Hidden Markov Models , 1996 .

[8]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[9]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[10]  A. Nakamura,et al.  Nature (London , 1975 .

[11]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[12]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[13]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[14]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[15]  Q. Summerfield,et al.  Lipreading and audio-visual speech perception. , 1992, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[16]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .