A fused hidden Markov model with application to bimodal speech processing

This paper presents a novel fused hidden Markov model (fused HMM) for integrating tightly coupled time series, such as audio and visual features of speech. In this model, the time series are first modeled by two conventional HMMs separately. The resulting HMMs are then fused together using a probabilistic fusion model, which is optimal according to the maximum entropy principle and a maximum mutual information criterion. Simulations and bimodal speaker verification experiments show that the proposed model can significantly reduce the recognition errors in noiseless or noisy environments.

[1]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[2]  Gérard Chollet,et al.  Special issue on audio- and video-based person authentication , 1997, Pattern Recognit. Lett..

[3]  Thomas S. Huang,et al.  Estimation of the joint probability of multisensory signals , 2001, Pattern Recognit. Lett..

[4]  Thomas Wagner,et al.  SESAM: A biometric person identification system using sensor fusion , 1997, Pattern Recognit. Lett..

[5]  Jr. G. Forney,et al.  The viterbi algorithm , 1973 .

[6]  Jiri Matas,et al.  Combining evidence in personal identity verification systems , 1997, Pattern Recognit. Lett..

[7]  J. van Leeuwen,et al.  Audio- and Video-Based Biometric Person Authentication , 2001, Lecture Notes in Computer Science.

[8]  Partha Niyogi,et al.  Feature based representation for audio-visual speech recognition , 1999, AVSP.

[9]  M.G. Bellanger,et al.  Digital processing of speech signals , 1980, Proceedings of the IEEE.

[10]  Juergen Luettin,et al.  Acoustic-labial speaker verification , 1997, Pattern Recognit. Lett..

[11]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[12]  Alex Pentland,et al.  A Bayesian Computer Vision System for Modeling Human Interaction , 1999, ICVS.

[13]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[14]  Lionel Revéret,et al.  A Viseme-based Approach to Labiometrics for Automatic Lipreading , 1997, AVBPA.

[15]  John S. D. Mason,et al.  "Watch These Lips" - Adding to Acoustic Signals to Improve Speaker Recognition , 1997, AVBPA.

[16]  L. R. Rabiner,et al.  An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition , 1983, The Bell System Technical Journal.

[17]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[18]  Stefan Fischer,et al.  Expert Conciliation for Multi Modal Person Authentication Systems by Bayesian Statistics , 1997, AVBPA.

[19]  S. P. Luttrell Hierarchical network for clutter and texture modeling , 1991, Optics & Photonics.

[20]  Martin J. Russell,et al.  Integrating audio and visual information to provide highly robust speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[21]  Michael I. Jordan,et al.  Mixed Memory Markov Models: Decomposing Complex Stochastic Processes as Mixtures of Simpler Ones , 1999, Machine Learning.

[22]  Alex Pentland,et al.  Coupled hidden Markov models for complex action recognition , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[23]  Stefan Fischer,et al.  Person Authentication by Fusing Face and Speech Information , 1997, AVBPA.

[24]  Matthew Brand,et al.  Coupled hidden Markov models for modeling interacting processes , 1997 .

[25]  R. Gray Entropy and Information Theory , 1990, Springer New York.

[26]  Alan C. Bovik,et al.  Computer lipreading for improved accuracy in automatic speech recognition , 1996, IEEE Trans. Speech Audio Process..

[27]  S. P. Luttrell,et al.  The Use of Bayesian and Entropic Methods in Neural Network Theory , 1989 .