Method of Speech Recognition and Speaker Identification using Audio-Visual of Polish Speech and Hidden Markov Models

Mainstream automatic speech recognition has focused almost exclusively on the acoustic signal. The performance of these systems degrades considerably in the real word in the presence of noise. It was needed novel approaches that use other orthogonal sources of information to the acoustic input that not only considerably improve the performance in severely degraded conditions, but also are independent to the type of noise and reverberation. Visual speech is one such source not perturbed by the acoustic environment and noise. In this paper, it was presented own approach to lip-tracking and fusion of signals audio and video for audio-visual speech and speaker recognition system. It was presented video analysis of visual speech for extraction visual features from a talking person in color video sequences. It was developed a method for automatically localization of face, eyes, region of mouth, corners and contour of mouth. It was proposed synchronous and two asynchronous of methods of fusion of signals audio and video. Finally, the paper will show results of lip-tracking depending on various factors (lighting, beard), results of speech and speaker recognition in noisy environments.

[1]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[2]  Pascal Fua,et al.  Skeleton-based motion capture for robust reconstruction of human motion , 2000, Proceedings Computer Animation 2000.

[3]  Alex Pentland,et al.  3D modeling and tracking of human lip motions , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[4]  Kuntal Sengupta,et al.  Audio-visual modeling for bimodal speech recognition , 2001, 2001 IEEE International Conference on Systems, Man and Cybernetics. e-Systems and e-Man for Cybernetics in Cyberspace (Cat.No.01CH37236).

[5]  David G. Stork,et al.  Visionary Speech: Looking Ahead to Practical Speechreading Systems , 1996 .

[6]  Dominic W. Massaro,et al.  SPEECH RECOGNITION AND SENSORY INTEGRATION , 1998 .

[7]  Andrew Blake,et al.  Real-Time Lip Tracking for Audio-Visual Speech Recognition Applications , 1996, ECCV.

[8]  Juergen Luettin,et al.  Visual Speech and Speaker Recognition , 1997 .

[9]  Thomas S. Huang,et al.  Real-time lip tracking and bimodal continuous speech recognition , 1998, 1998 IEEE Second Workshop on Multimedia Signal Processing (Cat. No.98EX175).

[10]  Jian Zhang,et al.  Real-time lip tracking for virtual lip implementation in virtual environments and computer games , 2001, 10th IEEE International Conference on Fuzzy Systems. (Cat. No.01CH37297).

[11]  Masayuki Nakajima,et al.  Realistic articulated character positioning and balance control in interactive environments , 1999, Proceedings Computer Animation 1999.

[12]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[13]  Kuntal Sengupta,et al.  HMM modeling for audio-visual speech recognition , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[14]  Roberto Cipolla,et al.  Fast visual tracking by temporal consensus , 1996, Image Vis. Comput..

[15]  A. Macleod,et al.  LIPS, TEETH, AND THE BENEFITS OF LIPREADING , 1989 .