Audiovisual Voice Activity Detection Based on Microphone Arrays and Color Information

Audiovisual voice activity detection is a necessary stage in several problems, such as advanced teleconferencing, speech recognition, and human-computer interaction. Lip motion and audio analysis provide a large amount of information that can be integrated to produce more robust audiovisual voice activity detection (VAD) schemes, as we discuss in this paper. Lip motion is very useful for detecting the active speaker, and in this paper we introduce a new approach for lips and visual VAD. First, the algorithm performs skin segmentation to reduce the search area for lip extraction, and the most likely lip and non-lip regions are detected using a Bayesian approach within the delimited area. Lip motion is then detected using Hidden Markov Models (HMMs) that estimate the likely occurrence of active speech within a temporal window. Audio information is captured by an array of microphones, and the sound-based VAD is related to finding spatio-temporally coherent sound sources through another set of HMMs. To increase the robustness of the proposed system, a late fusion approach is employed to combine the result of each modality (audio and video). Our experimental results indicate that the proposed audiovisual approach presents better results when compared to existing VAD algorithms.

[1]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[2]  R. Boostani,et al.  Lip segmentation in color images , 2008, 2008 International Conference on Innovations in Information Technology.

[3]  Ton Kalker,et al.  Voice activity detection and speaker localization using audiovisual cues , 2012, Pattern Recognit. Lett..

[4]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[5]  Wen Gao,et al.  Face detection and location based on skin chrominance and lip chrominance transformation from color images , 2001, Pattern Recognit..

[6]  Yoav Freund,et al.  The Alternating Decision Tree Learning Algorithm , 1999, ICML.

[7]  Jing Xu,et al.  Lip Detection and Tracking Using Variance Based Haar-Like Features and Kalman filter , 2010, 2010 Fifth International Conference on Frontier of Computer Science and Technology.

[8]  Joseph H. DiBiase A High-Accuracy, Low-Latency Technique for Talker Localization in Reverberant Environments Using Microphone Arrays , 2000 .

[9]  Christian Jutten,et al.  An Analysis of Visual Speech Information Applied to Voice Activity Detection , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[10]  Christian Jutten,et al.  Two novel visual voice activity detectors based on appearance models and retinal filtering , 2007, 2007 15th European Signal Processing Conference.

[11]  Satoshi Tamura,et al.  Voice activity detection based on fusion of audio and visual information , 2009, AVSP.

[12]  Mark Hasegawa-Johnson,et al.  Estimation of High-Variance Vehicular Noise , 2009 .

[13]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[14]  Jean-Marc Odobez,et al.  Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Michael S. Brandstein,et al.  Robust automatic video-conferencing with multiple cameras and microphones , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[16]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT.

[17]  Soo Ngee Koh,et al.  Improved noise suppression filter using self-adaptive estimator of probability of speech absence , 1999, Signal Process..

[18]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[19]  Amir Said,et al.  Feature-Based Face Tracking for Videoconferencing Applications , 2009, 2009 11th IEEE International Symposium on Multimedia.

[20]  João Gama,et al.  Functional Trees , 2001, Machine Learning.

[21]  Anthony G. Constantinides,et al.  Audio–Visual Active Speaker Tracking in Cluttered Indoors Environments , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[22]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[23]  Gerald Schaefer,et al.  Illuminant and device invariant colour using histogram equalisation , 2005, Pattern Recognit..

[24]  Eibe Frank,et al.  Combining Naive Bayes and Decision Tables , 2008, FLAIRS.

[25]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[26]  Ben P. Milner,et al.  Using audio-visual features for robust voice activity detection in clean and noisy speech , 2008, 2008 16th European Signal Processing Conference.

[27]  Ron Kohavi,et al.  The Power of Decision Tables , 1995, ECML.

[28]  Michael S. Brandstein,et al.  A hybrid real-time face tracking system , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[29]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[30]  Andrew R. Webb,et al.  Statistical Pattern Recognition , 1999 .

[31]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[32]  Mohan S. Kankanhalli,et al.  Multimodal fusion for multimedia analysis: a survey , 2010, Multimedia Systems.

[33]  Tetsuya Takiguchi,et al.  Voice activity detection by lip shape tracking using EBGM , 2007, ACM Multimedia.

[34]  S. Gökhun Tanyer,et al.  Voice activity detection in nonstationary noise , 2000, IEEE Trans. Speech Audio Process..

[35]  Wei Zhang,et al.  A soft voice activity detector based on a Laplacian-Gaussian model , 2003, IEEE Trans. Speech Audio Process..

[36]  Bowon Lee,et al.  Spectral entropy-based voice activity detector for videoconferencing systems , 2010, INTERSPEECH.

[37]  J. Wade Davis,et al.  Statistical Pattern Recognition , 2003, Technometrics.

[38]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[39]  Anil K. Jain,et al.  Unsupervised Learning of Finite Mixture Models , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[40]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[41]  Jacob Scharcanski,et al.  Color-based lips extraction applied to voice activity detection , 2011, 2011 18th IEEE International Conference on Image Processing.

[42]  Christian Jutten,et al.  A study of lip movements during spontaneous dialog and its application to voice activity detection. , 2009, The Journal of the Acoustical Society of America.

[43]  Aristodemos Pnevmatikakis,et al.  Voice activity detection using audio-visual information , 2009, 2009 16th International Conference on Digital Signal Processing.

[44]  Narendra Ahuja,et al.  Gaussian mixture model for human skin color and its applications in image and video databases , 1998, Electronic Imaging.

[45]  Wonyong Sung,et al.  A voice activity detector employing soft decision based noise spectrum adaptation , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).