Visual Lip Activity Detection and Speaker Detection Using Mouth Region Intensities

In this letter, we introduce a novel approach for lip activity detection and speaker detection, using solely visual information. The main idea in this work is to apply signal detection algorithms to a simple and easily extracted feature from the mouth region. We argue that the increased average value and standard deviation of the number of pixels with low intensities that the mouth region of a speaking person demonstrates can be used as visual cues for detecting visual speech. We then proceed in deriving a statistical algorithm that utilizes this fact for the efficient characterization of visual speech and silence in video sequences. Furthermore, we employ the lip activity detection method in order to determine the active speaker(s) in a multi-person environment.

[1]  I. Miller Probability, Random Variables, and Stochastic Processes , 1966 .

[2]  Lawrence Sirovich,et al.  Application of the Karhunen-Loeve Procedure for the Characterization of Human Faces , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Josef Bigün,et al.  Synergy of Lip-Motion and Acoustic Features in Biometric Speech and Speaker Recognition , 2007, IEEE Transactions on Computers.

[4]  Javier R. Movellan,et al.  Visual Speech Recognition with Stochastic Networks , 1994, NIPS.

[5]  Jean-Philippe Thiran,et al.  Extraction of Audio Features Specific to Speech Production for Multimodal Speaker Detection , 2008, IEEE Transactions on Multimedia.

[6]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[7]  Don H. Johnson,et al.  Array Signal Processing: Concepts and Techniques , 1993 .

[8]  Christian Jutten,et al.  An Analysis of Visual Speech Information Applied to Voice Activity Detection , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[9]  Aggelos K. Katsaggelos,et al.  Audio-Visual Speech Recognition Using MPEG-4 Compliant Visual Features , 2002, EURASIP J. Adv. Signal Process..

[10]  Ioannis Pitas,et al.  Detection of facial characteristics based on edge information , 2007, VISAPP.

[11]  S. Kay Fundamentals of statistical signal processing: estimation theory , 1993 .

[12]  Chalapathy Neti,et al.  Audio-visual intent-to-speak detection for human-computer interaction , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[13]  Athanasios Papoulis,et al.  Probability, Random Variables and Stochastic Processes , 1965 .

[14]  Peng Liu,et al.  Voice activity detection using visual information , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[16]  A. Macleod,et al.  A procedure for measuring auditory and audio-visual speech-reception thresholds for sentences in noise: rationale, evaluation, and recommendations for use. , 1990, British journal of audiology.

[17]  Jean-Philippe Thiran,et al.  Multimodal speaker localization in a probabilistic framework , 2006, 2006 14th European Signal Processing Conference.

[18]  Paul A. Viola,et al.  Boosting-Based Multimodal Speaker Detection for Distributed Meetings , 2006, 2006 IEEE Workshop on Multimedia Signal Processing.

[19]  Bruno O. Shubert,et al.  Random variables and stochastic processes , 1979 .

[20]  Ioannis Pitas,et al.  A Support Vector Machine-Based Dynamic Network for Visual Speech Recognition Applications , 2002, EURASIP J. Adv. Signal Process..

[21]  Ioannis Pitas,et al.  Visual speech detection using mouth region intensities , 2006, 2006 14th European Signal Processing Conference.