Classification and weakly supervised pain localization using multiple segment representation

Automatic pain recognition from videos is a vital clinical application and, owing to its spontaneous nature, poses interesting challenges to automatic facial expression recognition (AFER) research. Previous pain vs no-pain systems have highlighted two major challenges: (1) ground truth is provided for the sequence, but the presence or absence of the target expression for a given frame is unknown, and (2) the time point and the duration of the pain expression event(s) in each video are unknown. To address these issues we propose a novel framework (referred to as MS-MIL) where each sequence is represented as a bag containing multiple segments, and multiple instance learning (MIL) is employed to handle this weakly labeled data in the form of sequence level ground-truth. These segments are generated via multiple clustering of a sequence or running a multi-scale temporal scanning window, and are represented using a state-of-the-art Bag of Words (BoW) representation. This work extends the idea of detecting facial expressions through 'concept frames' to 'concept segments' and argues through extensive experiments that algorithms such as MIL are needed to reap the benefits of such representation. The key advantages of our approach are: (1) joint detection and localization of painful frames using only sequence-level ground-truth, (2) incorporation of temporal dynamics by representing the data not as individual frames but as segments, and (3) extraction of multiple segments, which is well suited to signals with uncertain temporal location and duration in the video. Extensive experiments on UNBC-McMaster Shoulder Pain dataset highlight the effectiveness of the approach by achieving competitive results on both tasks of pain classification and localization in videos. We also empirically evaluate the contributions of different components of MS-MIL. The paper also includes the visualization of discriminative facial patches, important for pain detection, as discovered by our algorithm and relates them to Action Units that have been associated with pain expression. We conclude the paper by demonstrating that MS-MIL yields a significant improvement on another spontaneous facial expression dataset, the FEEDTUM dataset.

[1]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[2]  T. Poggio,et al.  Bagging Regularizes , 2002 .

[3]  K. Craig,et al.  Genuine, suppressed and faked facial behavior during exacerbation of chronic low back pain , 1991, Pain.

[4]  Limin Wang,et al.  A Comparative Study of Encoding, Pooling and Normalization Methods for Action Recognition , 2012, ACCV.

[5]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[6]  Tsuhan Chen,et al.  The painful face - Pain expression recognition using active appearance models , 2009, Image Vis. Comput..

[7]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[8]  Yang Song,et al.  Handling label noise in video classification via multiple instance learning , 2011, 2011 International Conference on Computer Vision.

[9]  John Gration,et al.  Clinical Pain Management: A Practical Guide , 2011 .

[10]  Marian Stewart Bartlett,et al.  Exploring Bag of Words Architectures in the Facial Expression Domain , 2012, ECCV Workshops.

[11]  Serge J. Belongie,et al.  Simultaneous Learning and Alignment: Multi-Instance and Multi-Pose Learning ? , 2008 .

[12]  K. Prkachin,et al.  The structure, reliability and validity of pain expression: Evidence from patients with shoulder pain , 2008, PAIN.

[13]  Randolph R. Cornelius,et al.  The science of emotion: Research and tradition in the psychology of emotion. , 1997 .

[14]  Matthew S. Goodwin,et al.  (108) Automated facial expression analysis can detect clinical pain in youth in the post-operative setting , 2014 .

[15]  Ayoub Al-Hamadi,et al.  The effectiveness of using geometrical features for facial expression recognition , 2013, 2013 IEEE International Conference on Cybernetics (CYBCO).

[16]  Giridharan Iyengar,et al.  A Cascade Visual Front End for Speaker Independent Automatic Speechreading , 2001, Int. J. Speech Technol..

[17]  Gwen Littlewort,et al.  Automatic Recognition of Facial Actions in Spontaneous Expressions , 2006, J. Multim..

[18]  Jeffrey F. Cohn,et al.  Painful data: The UNBC-McMaster shoulder pain expression archive database , 2011, Face and Gesture 2011.

[19]  Maja Pantic,et al.  The Detection of Concept Frames Using Clustering Multi-instance Learning , 2010, 2010 20th International Conference on Pattern Recognition.

[20]  Fernando De la Torre,et al.  Action unit detection with segment-based SVMs , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[21]  R. Dworkin,et al.  What should be the core outcomes in chronic pain clinical trials? , 2004, Arthritis research & therapy.

[22]  Fernando De la Torre,et al.  Facial Expression Analysis , 2011, Visual Analysis of Humans.

[23]  B. Welch The structure , 1992 .

[24]  Gang Wang,et al.  Using Dependent Regions for Object Categorization in a Generative Framework , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[25]  Takeo Kanade,et al.  The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified expression , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[26]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Fernando De la Torre,et al.  Supervised Descent Method and Its Applications to Face Alignment , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Horst Bischof,et al.  Multiple Instance Boosting for Face Recognition in Videos , 2011, DAGM-Symposium.

[30]  Tamás D. Gedeon,et al.  Emotion recognition using PHOG and LPQ features , 2011, Face and Gesture 2011.

[31]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[32]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[33]  Boris Babenko,et al.  Weakly Supervised Object Localization with Stable Segmentations , 2008, ECCV.

[34]  W. David Leak,et al.  Handbook of Pain Assessment , 1994 .

[35]  Sridha Sridharan,et al.  Improving pain recognition through better utilisation of temporal information , 2008, AVSP.

[36]  Thomas Serre,et al.  Object recognition with features inspired by visual cortex , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[37]  Takeo Kanade,et al.  Facial Expression Analysis , 2011, AMFG.

[38]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[39]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[40]  Paul A. Viola,et al.  Multiple Instance Boosting for Object Detection , 2005, NIPS.

[41]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .