Lip event detection using oriented histograms of regional optical flow and low rank affinity pursuit

An efficient oriented histograms of regional optical flow (OH-ROF) is presented to discriminatively code the visual appearance of each lip motion frame.Each lip motion clip is discriminatively represented by a sequence of OH-ROF vectors as its signature.We introduce a sequence stabilization scheme to reduce the impact of irrelevant motions.We address an efficient approach to detecting the visual silence event via the small flow magnitude.We propose a low rank affinity pursuit method to determine the lip-dynamic states of mouth opening and closing. Lip event detection is of crucial importance to the better understanding of visual speech perceptually between humans and computers. In this paper, we address an efficient lip event detection approach using oriented histograms of regional optical flow (OH-ROF) and low rank affinity pursuit. First, we align the extracted lip region sequences to reduce the impact of irrelevant motion caused by the moving cameras. Then, an optical flow field is calculated from these sequentially stabilized images and an efficient descriptor, namely OH-ROF, is presented to discriminatively code the visual appearance of each motion frame, whereby each lip motion clip can be represented by a sequence of OH-ROF vectors as its signature. Subsequently, we detect the visual silence event based on the small flow magnitude, and further propose a low rank affinity pursuit method to determine the visual speech event that incorporates the lip-dynamic states of mouth opening and closing. As a result, various kind of lip motion events can be appropriately estimated. The proposed approach neither requires any training set on the labeled videos nor learns the lip motion priors of each visual event in an unconstrained video. Experiments show a promising result in comparison with the state-of-the-art counterparts.

[1]  Khashayar Yaghmaie,et al.  Automatic visual speech segmentation , 2011, 2011 IEEE 3rd International Conference on Communication Software and Networks.

[2]  Alan Wee-Chung Liew,et al.  An Automatic Lipreading System for Spoken Digits With Limited Training Data , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[3]  John Wright,et al.  Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Matrices via Convex Optimization , 2009, NIPS.

[4]  Gérard Chollet,et al.  Talking faces indexing in TV-content , 2010, 2010 International Workshop on Content Based Multimedia Indexing (CBMI).

[5]  Luhong Liang,et al.  A detector tree of boosted classifiers for real-time object detection and tracking , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[6]  Gregory D. Hager,et al.  Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions , 2009, CVPR.

[7]  Jonathon A. Chambers,et al.  Visual voice activity detection with optical flow , 2010 .

[8]  Michael J. Black,et al.  Secrets of optical flow estimation and their principles , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[9]  S. Yun,et al.  An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems , 2009 .

[10]  J. Weickert,et al.  Lucas/Kanade meets Horn/Schunck: combining local and global optic flow methods , 2005 .

[11]  Sridha Sridharan,et al.  Visual Voice Activity Detection Using Frontal versus Profile Views , 2011, 2011 International Conference on Digital Image Computing: Techniques and Applications.

[12]  Juergen Luettin,et al.  Visual speech recognition using active shape models and hidden Markov models , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[13]  Bum-Jae You,et al.  Robust visual speakingness detection using bi-level HMM , 2012, Pattern Recognit..

[14]  Xin Liu,et al.  Learning Multi-Boosted HMMs for Lip-Password Based Speaker Verification , 2014, IEEE Transactions on Information Forensics and Security.

[15]  Dinesh Kant Kumar,et al.  Automatic visual speech segmentation and recognition using directional motion history images and Zernike moments , 2013, The Visual Computer.

[16]  Ioannis Pitas,et al.  Visual Lip Activity Detection and Speaker Detection Using Mouth Region Intensities , 2009, IEEE Transactions on Circuits and Systems for Video Technology.

[17]  Peng Liu,et al.  Voice activity detection using visual information , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Christian Jutten,et al.  An Analysis of Visual Speech Information Applied to Voice Activity Detection , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[19]  A. Murat Tekalp,et al.  Discriminative Analysis of Lip Motion Features for Speaker Identification and Speech-Reading , 2006, IEEE Transactions on Image Processing.

[20]  Josef Bigün,et al.  Lip-motion events analysis and lip segmentation using optical flow , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[21]  Marc Lievin,et al.  Lip motion automatic detection , 1997 .

[22]  Josef Bigün,et al.  Synergy of Lip-Motion and Acoustic Features in Biometric Speech and Speaker Recognition , 2007, IEEE Transactions on Computers.

[23]  Gerasimos Potamianos,et al.  An Embedded System for In-Vehicle Visual Speech Activity Detection , 2007, 2007 IEEE 9th Workshop on Multimedia Signal Processing.

[24]  Marc Pollefeys,et al.  A General Framework for Motion Segmentation: Independent, Articulated, Rigid, Non-rigid, Degenerate and Non-degenerate , 2006, ECCV.

[25]  Simon Baker,et al.  Lucas-Kanade 20 Years On: A Unifying Framework , 2004, International Journal of Computer Vision.

[26]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[27]  Dinesh Kant Kumar,et al.  Visual Speech Recognition and Utterance Segmentation Based on Mouth Movement , 2007, 9th Biennial Conference of the Australian Pattern Recognition Society on Digital Image Computing Techniques and Applications (DICTA 2007).

[28]  Brian C. Lovell,et al.  Multi-Region Probabilistic Histograms for Robust and Scalable Identity Inference , 2009, ICB.

[29]  Jean-Marc Odobez,et al.  Robust Multiresolution Estimation of Parametric Motion Models , 1995, J. Vis. Commun. Image Represent..

[30]  Kenichi Kanatani,et al.  Motion Segmentation by Subspace Separation: Model Selection and Reliability Evaluation , 2002, Int. J. Image Graph..

[31]  John Wright,et al.  RASL: Robust Alignment by Sparse and Low-Rank Decomposition for Linearly Correlated Images , 2012, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  Hanseok Ko,et al.  Visual voice activity detection via chaos based lip motion measure robust under illumination changes , 2014, IEEE Transactions on Consumer Electronics.

[33]  Takeo Kanade,et al.  Recognizing Action Units for Facial Expression Analysis , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[34]  Timothy F. Cootes,et al.  Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[35]  Zhen Cui,et al.  Automatic motion capture data denoising via filtered subspace clustering and low rank matrix approximation , 2014, Signal Process..

[36]  Christian Jutten,et al.  Visual voice activity detection as a help for speech source separation from convolutive mixtures , 2007, Speech Commun..

[37]  Tetsuya Takiguchi,et al.  Voice activity detection by lip shape tracking using EBGM , 2007, ACM Multimedia.