Latent Dynamic Space-Time Volumes for Predicting Human Facial Behavior in Videos

Author(s): Sikka, Karan | Advisor(s): Ngyuen, Truong | Abstract: Enabling machines to understand non-verbal facial behavior from visual data is crucial for building smart interactive systems. This thesis focusses on human behavior analysis in videos. Previous state-of-the-art methods generally employed global temporal pooling approaches that, (i) assume presence of a single uniform event spanning the sequence, and (ii) discard temporal ordering by squashing all information along the temporal dimension. In this dissertation we focus on two specific modeling challenges unaddressed by previous approaches. First issue is training with weak labels that only provide video-level annotations and are much cheaper to obtain than fine (frame-level) annotations. The second concerns modeling temporal dynamics during prediction, as facial expressions are dynamic actions with sub-events. We propose to tackle these issues by proposing methods based on Weakly Supervised Latent Variable Models (WSLVM) and evaluate them on real-world spontaneous expressions. We begin with addressing these challenges by combining Multiple Instance Learning (MIL) framework and Multiple Segment representation (MS-MIL). MS-MIL can simultaneously classify and localize target behavior in videos despite training with weak annotations. However, this method lacks the capability to explicitly model multiple latent concepts or global temporal order. We address this issue in the next chapter by explicitly modeling temporal orderings by learning an exemplar Hidden Markov Model for each sequence. This algorithm models dependencies between segments but is limited in its modeling capacity due to the use of generative modeling. Chapter~4 extends MIL to learn multiple discriminative concepts in a novel formulation for joint clustering and classification. This algorithm shows consistent performance improvement but does not capture temporal structure. We finally present a unified learning framework that combines the strengths of the previously proposed algorithms in that it (i) addresses weakly labeled data (ii) learns multiple discriminative concepts, and (iii) models the temporal ordering structure of the concepts. This method is a novel WSLVM that models a video as a sequence of automatically mined, multiple discriminative sub-events with a loose temporal structure. We show both qualitative and quantitative results highlighting improvements over state-of-the-art algorithms by jointly addressing weak labels and temporal dynamics.

[1]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[2]  Zhengyou Zhang,et al.  Comparison between geometry-based and Gabor-wavelets-based facial expression recognition using multi-layer perceptron , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[3]  Christian Szegedy,et al.  DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Qiang Ji,et al.  Multi-instance Hidden Markov Model for facial expression recognition , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[5]  Matthew S. Goodwin,et al.  Automated Assessment of Children’s Postoperative Pain Using Computer Vision , 2015, Pediatrics.

[6]  Marian Stewart Bartlett,et al.  Facial expression recognition using Gabor motion energy filters , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[7]  Gwen Littlewort,et al.  The computer expression recognition toolbox (CERT) , 2011, Face and Gesture 2011.

[8]  Azriel Rosenfeld,et al.  Face recognition: A literature survey , 2003, CSUR.

[9]  Limin Wang,et al.  A Comparative Study of Encoding, Pooling and Normalization Methods for Action Recognition , 2012, ACCV.

[10]  Matti Pietikäinen,et al.  Dynamic Texture Recognition Using Local Binary Patterns with an Application to Facial Expressions , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Thomas Hofmann,et al.  Support Vector Machines for Multiple-Instance Learning , 2002, NIPS.

[12]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[13]  Yi Li,et al.  ARISTA - image search to annotation on billions of web photos , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[14]  Gang Wang,et al.  Using Dependent Regions for Object Categorization in a Generative Framework , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[15]  Andrea Vedaldi,et al.  Dynamic Image Networks for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Trevor Darrell,et al.  Hidden Conditional Random Fields , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[18]  Takeo Kanade,et al.  Emotional Expression Classification Using Time-Series Kernels , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[19]  Kenneth D. Craig,et al.  Clinical Pain Management: A Practical Guide , 2010 .

[20]  Shiguang Shan,et al.  Learning Expressionlets on Spatio-temporal Manifold for Dynamic Facial Expression Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Shang-Hong Lai,et al.  Learning partially-observed hidden conditional random fields for facial expression recognition , 2009, CVPR.

[22]  Kent Larson,et al.  Activity Recognition in the Home Using Simple and Ubiquitous Sensors , 2004, Pervasive.

[23]  Takeo Kanade,et al.  The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified expression , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[24]  David Beymer,et al.  A real-time computer vision system for vehicle tracking and traffic surveillance , 1998 .

[25]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[26]  Gwen Littlewort,et al.  The motion in emotion — A CERT based approach to the FERA emotion challenge , 2011, Face and Gesture 2011.

[27]  Fernando De la Torre,et al.  Action unit detection with segment-based SVMs , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[28]  Guillermo Sapiro,et al.  Dictionary learning and sparse coding for unsupervised clustering , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Beat Fasel,et al.  Automati Fa ial Expression Analysis: A Survey , 1999 .

[30]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[31]  Boris Babenko,et al.  Weakly Supervised Object Localization with Stable Segmentations , 2008, ECCV.

[32]  Matti Pietikäinen,et al.  Facial expression recognition from near-infrared videos , 2011, Image Vis. Comput..

[33]  András Lörincz,et al.  High quality facial expression recognition in video streams using shape related information only , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[34]  Andrew Zisserman,et al.  Video Google: Efficient Visual Search of Videos , 2006, Toward Category-Level Object Recognition.

[35]  Gwen Littlewort,et al.  Multiple kernel learning for emotion recognition in the wild , 2013, ICMI '13.

[36]  Gwen Littlewort,et al.  Discrimination of Moderate and Acute Drowsiness Based on Spontaneous Facial Expressions , 2010, 2010 20th International Conference on Pattern Recognition.

[37]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[38]  Boris Babenko Multiple Instance Learning: Algorithms and Applications , 2008 .

[39]  Richard Bowden,et al.  Feature selection of facial displays for detection of non verbal communication in natural conversation , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[40]  Lifeng Shang,et al.  Nonparametric discriminant HMM and application to facial expression recognition , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[42]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[43]  Tomás Lozano-Pérez,et al.  A Framework for Multiple-Instance Learning , 1997, NIPS.

[44]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[45]  Junmo Kim,et al.  Joint Fine-Tuning in Deep Neural Networks for Facial Expression Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[46]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[47]  T. Poggio,et al.  Bagging Regularizes , 2002 .

[48]  Yang Wang,et al.  Improving Human Action Recognition by Non-action Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Ricardo da Silva Torres,et al.  Comparative study of global color and texture descriptors for web image retrieval , 2012, J. Vis. Commun. Image Represent..

[50]  Yang Song,et al.  Handling label noise in video classification via multiple instance learning , 2011, 2011 International Conference on Computer Vision.

[51]  Qingshan Liu,et al.  Facial expression recognition using encoded dynamic features , 2007, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[52]  Yixin Chen,et al.  MILES: Multiple-Instance Learning via Embedded Instance Selection , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[53]  Daniel McDuff,et al.  Affectiva-MIT Facial Expression Dataset (AM-FED): Naturalistic and Spontaneous Facial Expressions Collected "In-the-Wild" , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[54]  Marian Stewart Bartlett,et al.  Classification and weakly supervised pain localization using multiple segment representation , 2014, Image Vis. Comput..

[55]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[56]  Matthew S. Goodwin,et al.  (108) Automated facial expression analysis can detect clinical pain in youth in the post-operative setting , 2014 .

[57]  Paul A. Viola,et al.  Multiple Instance Boosting for Object Detection , 2005, NIPS.

[58]  Zhouyu Fu,et al.  An instance selection approach to Multiple instance Learning , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  Qiang Ji,et al.  Capturing Complex Spatio-temporal Relations among Facial Muscles for Facial Expression Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[60]  Ronald Melzack,et al.  Handbook of pain assessment , 1992 .

[61]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[62]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[63]  Randolph R. Cornelius,et al.  The science of emotion: Research and tradition in the psychology of emotion. , 1997 .

[64]  José A. Rodríguez-Serrano,et al.  A similarity measure between vector sequences with application to handwritten word image retrieval , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[65]  Marian Stewart Bartlett,et al.  Joint Clustering and Classification for Multiple Instance Learning , 2015, BMVC.

[66]  Jean Ponce,et al.  Task-Driven Dictionary Learning , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[67]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[68]  Dan Zhang,et al.  A Discriminative Data-Dependent Mixture-Model Approach for Multiple Instance Learning in Image Classification , 2012, ECCV.

[69]  Rana El Kaliouby,et al.  Automatic measurement of ad preferences from facial responses gathered over the Internet , 2014, Image Vis. Comput..

[70]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[71]  Gaurav Sharma,et al.  Discriminatively Trained Latent Ordinal Model for Video Classification , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[72]  Daniel McDuff,et al.  Predicting online media effectiveness based on smile responses gathered over the Internet , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[73]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[74]  Carsten Rother,et al.  Weakly supervised discriminative localization and classification: a joint learning process , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[75]  K. Prkachin,et al.  The structure, reliability and validity of pain expression: Evidence from patients with shoulder pain , 2008, PAIN.

[76]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[77]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[78]  Greg Mori,et al.  Max-margin hidden conditional random fields for human action recognition , 2009, CVPR.

[79]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[80]  Anonymous Authors Empowering Multiple Instance Histopathology Cancer Diagnosis by Cell Graphs , 2014 .

[81]  Zhuowen Tu,et al.  Max-Margin Multiple-Instance Dictionary Learning , 2013, ICML.

[82]  Marian Stewart Bartlett,et al.  Weakly supervised pain localization using multiple instance learning , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[83]  Takeo Kanade,et al.  Automated facial expression recognition based on FACS action units , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[84]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[85]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[86]  Giridharan Iyengar,et al.  A Cascade Visual Front End for Speaker Independent Automatic Speechreading , 2001, Int. J. Speech Technol..

[87]  Gwen Littlewort,et al.  Automatic Recognition of Facial Actions in Spontaneous Expressions , 2006, J. Multim..

[88]  Mário A. T. Figueiredo,et al.  Similarity-Based Clustering of Sequences Using Hidden Markov Models , 2003, MLDM.

[89]  Zhuowen Tu,et al.  Multiple clustered instance learning for histopathology cancer image classification, segmentation and clustering , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[90]  Matti Pietikäinen,et al.  Dynamic Facial Expression Recognition Using Longitudinal Facial Expression Atlases , 2012, ECCV.

[91]  Jaume Amores,et al.  Multiple instance classification: Review, taxonomy and comparative study , 2013, Artif. Intell..

[92]  Yixin Chen,et al.  Image Categorization by Learning and Reasoning with Regions , 2004, J. Mach. Learn. Res..

[93]  Marian Stewart Bartlett,et al.  Exploring Bag of Words Architectures in the Facial Expression Domain , 2012, ECCV Workshops.

[94]  Horst Bischof,et al.  MIForests: Multiple-Instance Learning with Randomized Trees , 2010, ECCV.

[95]  Fernando De la Torre,et al.  Supervised Descent Method and Its Applications to Face Alignment , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[96]  Horst Bischof,et al.  Multiple Instance Boosting for Face Recognition in Videos , 2011, DAGM-Symposium.

[97]  James M. Rehg,et al.  Learning to Predict Gaze in Egocentric Video , 2013, 2013 IEEE International Conference on Computer Vision.

[98]  Subhashini Venugopalan,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[99]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[100]  Vladimir Pavlovic,et al.  Multi-output Laplacian dynamic ordinal regression for facial expression recognition and intensity estimation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[101]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[102]  Rama Chellappa,et al.  Dictionary-based multiple instance learning , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[103]  Jonathan Tompson,et al.  Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation , 2014, NIPS.

[104]  Thomas Hofmann,et al.  Hidden Markov Support Vector Machines , 2003, ICML.

[105]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[106]  Ayoub Al-Hamadi,et al.  Towards Pain Monitoring: Facial Expression, Head Pose, a new Database, an Automatic System and Remaining , 2013, BMVC.

[107]  Serge J. Belongie,et al.  Simultaneous Learning and Alignment: Multi-Instance and Multi-Pose Learning ? , 2008 .

[108]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[109]  Qiang Ji,et al.  Active and dynamic information fusion for facial expression understanding from image sequences , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[110]  Jeffrey F. Cohn,et al.  Painful data: The UNBC-McMaster shoulder pain expression archive database , 2011, Face and Gesture 2011.

[111]  Qi Zhang,et al.  EM-DD: An Improved Multiple-Instance Learning Technique , 2001, NIPS.

[112]  Massimo Bertozzi,et al.  Vision-based intelligent vehicles: State of the art and perspectives , 2000, Robotics Auton. Syst..

[113]  Matti Pietikäinen,et al.  Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[114]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[115]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .

[116]  Antonio Torralba,et al.  Building the gist of a scene: the role of global image features in recognition. , 2006, Progress in brain research.

[117]  Marwan Mattar,et al.  Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments , 2008 .

[118]  Yan Ke,et al.  PCA-SIFT: a more distinctive representation for local image descriptors , 2004, CVPR 2004.

[119]  Mohamed S. Kamel,et al.  Supervised Dictionary Learning and Sparse Representation-A Review , 2015, ArXiv.

[120]  Nuno Vasconcelos,et al.  Multiple instance learning for soft bags via top instances , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[121]  Zhiwei Li,et al.  Max-Margin Dictionary Learning for Multiclass Image Categorization , 2010, ECCV.

[122]  Zhi-Hua Zhou,et al.  Multi-instance learning by treating instances as non-I.I.D. samples , 2008, ICML '09.

[123]  Ayoub Al-Hamadi,et al.  The effectiveness of using geometrical features for facial expression recognition , 2013, 2013 IEEE International Conference on Cybernetics (CYBCO).

[124]  Tamás D. Gedeon,et al.  Emotion recognition using PHOG and LPQ features , 2011, Face and Gesture 2011.

[125]  Christof Koch,et al.  Predicting human gaze using low-level saliency combined with face detection , 2007, NIPS.

[126]  Tsuhan Chen,et al.  The painful face - Pain expression recognition using active appearance models , 2009, Image Vis. Comput..

[127]  Maja Pantic,et al.  Fully Automatic Recognition of the Temporal Phases of Facial Actions , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[128]  Maja Pantic,et al.  Continuous Pain Intensity Estimation from Facial Expressions , 2012, ISVC.

[129]  Joost van de Weijer,et al.  Regularized Multi-Concept MIL for weakly-supervised facial behavior categorization , 2014, BMVC.

[130]  Sridha Sridharan,et al.  Improving pain recognition through better utilisation of temporal information , 2008, AVSP.

[131]  Thomas Serre,et al.  Object recognition with features inspired by visual cortex , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[132]  Marian Stewart Bartlett,et al.  Exemplar Hidden Markov Models for classification of facial expressions in videos , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[133]  Trevor Darrell,et al.  The pyramid match kernel: discriminative classification with sets of image features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[134]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[135]  Andrew Zisserman,et al.  Deep Face Recognition , 2015, BMVC.

[136]  Gaurav Sharma,et al.  LOMo: Latent Ordinal Model for Facial Analysis in Videos , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[137]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[138]  Maja Pantic,et al.  The Detection of Concept Frames Using Clustering Multi-instance Learning , 2010, 2010 20th International Conference on Pattern Recognition.

[139]  R. Dworkin,et al.  What should be the core outcomes in chronic pain clinical trials? , 2004, Arthritis research & therapy.

[140]  Fernando De la Torre,et al.  Facial Expression Analysis , 2011, Visual Analysis of Humans.

[141]  Nicu Sebe,et al.  Facial expression recognition from video sequences: temporal and static modeling , 2003, Comput. Vis. Image Underst..