The Pros and Cons: Rank-Aware Temporal Attention for Skill Determination in Long Videos

We present a new model to determine relative skill from long videos, through learnable temporal attention modules. Skill determination is formulated as a ranking problem, making it suitable for common and generic tasks. However, for long videos, parts of the video are irrelevant for assessing skill, and there may be variability in the skill exhibited throughout a video. We therefore propose a method which assesses the relative overall level of skill in a long video by attending to its skill-relevant parts. Our approach trains temporal attention modules, learned with only video-level supervision, using a novel rank-aware loss function. In addition to attending to task-relevant video parts, our proposed loss jointly trains two attention modules to separately attend to video parts which are indicative of higher (pros) and lower (cons) skill. We evaluate our approach on the EPIC-Skills dataset and additionally annotate a larger dataset from YouTube videos for skill determination with five previously unexplored tasks. Our method outperforms previous approaches and classic softmax attention on both datasets by over 4% pairwise accuracy, and as much as 12% on individual tasks. We also demonstrate our model’s ability to attend to rank-aware parts of the video.

[1]  Dima Damen,et al.  You-Do, I-Learn: Discovering Task Relevant Objects and their Modes of Interaction from Multi-User Egocentric Video , 2014, BMVC.

[2]  Ruslan Salakhutdinov,et al.  Action Recognition using Visual Attention , 2015, NIPS 2015.

[3]  Tao Mei,et al.  Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Xiao Liu,et al.  Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Cees Snoek,et al.  VideoLSTM convolves, attends and flows for action recognition , 2016, Comput. Vis. Image Underst..

[6]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Louis-Philippe Morency,et al.  Temporal Attention-Gated Model for Robust Sequence Classification , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Ziheng Wang,et al.  Deep learning with convolutional neural network for objective skill evaluation in robot-assisted surgery , 2018, International Journal of Computer Assisted Radiology and Surgery.

[9]  Irfan Essa,et al.  Video Based Assessment of OSATS Using Sequential Motion Textures , 2014 .

[10]  Bohyung Han,et al.  Weakly Supervised Action Localization by Sparse Temporal Pooling Network , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Gregory D. Hager,et al.  Pairwise Comparison-Based Objective Score for Automated Skill Assessment of Segments in a Surgical Task , 2014, IPCAI.

[12]  Antonio Torralba,et al.  Assessing the Quality of Actions , 2014, ECCV.

[13]  Ivan Laptev,et al.  Unsupervised Learning from Narrated Instruction Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Andrew S. Gordon,et al.  Automated Video Assessment of Human Performance , 1997 .

[15]  Baoxin Li,et al.  Video-based motion expertise analysis in simulation-based surgical training using hierarchical dirichlet process hidden markov model , 2011, MMAR '11.

[16]  Xiangyang Xue,et al.  Learning to Score Figure Skating Sport Videos , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[17]  Tao Mei,et al.  Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-Grained Image Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Amit K. Roy-Chowdhury,et al.  W-TALC: Weakly-supervised Temporal Activity Localization and Classification , 2018, ECCV.

[19]  Michael D Klein,et al.  Automated robot‐assisted surgical skill evaluation: Predictive analytics approach , 2018, The international journal of medical robotics + computer assisted surgery : MRCAS.

[20]  Dima Damen,et al.  Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[21]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[22]  Germain Forestier,et al.  Discovering Discriminative and Interpretable Patterns for Surgical Motion Analysis , 2017, AIME.

[23]  Yachna Sharma,et al.  Automated video-based assessment of surgical skills for training and evaluation in medical schools , 2016, International Journal of Computer Assisted Radiology and Surgery.

[24]  Irfan A. Essa,et al.  Automated Assessment of Surgical Skills Using Frequency Analysis , 2015, MICCAI.

[25]  Michael S. Ryoo,et al.  Title Learning Latent Subevents in Activity Videos Using Temporal Attention Filters , 2016, AAAI.

[26]  Martin A. Giese,et al.  Estimation of Skill Levels in Sports Based on Hierarchical Spatio-Temporal Correspondences , 2003, DAGM-Symposium.

[27]  Irfan A. Essa,et al.  Video and accelerometer-based motion analysis for automated surgical skills assessment , 2017, International Journal of Computer Assisted Radiology and Surgery.

[28]  Lin Wu,et al.  Where-and-When to Look: Deep Siamese Attention Networks for Video-Based Person Re-Identification , 2018, IEEE Transactions on Multimedia.

[29]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[30]  Michael S. Ryoo,et al.  Learning Latent Sub-events in Activity Videos Using Temporal Attention Filters , 2016, AAAI 2017.

[31]  Henry C. Lin,et al.  JHU-ISI Gesture and Skill Assessment Working Set ( JIGSAWS ) : A Surgical Activity Dataset for Human Motion Modeling , 2014 .

[32]  Bowen Zhou,et al.  A Structured Self-attentive Sentence Embedding , 2017, ICLR.

[33]  Jessica K. Hodgins,et al.  Guide to the Carnegie Mellon University Multimodal Activity (CMU-MMAC) Database , 2008 .

[34]  Dima Damen,et al.  Who's Better? Who's Best? Pairwise Deep Ranking for Skill Determination , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Xiaogang Wang,et al.  Diversity Regularized Spatiotemporal Attention for Video-Based Person Re-identification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Jianbo Shi,et al.  Am I a Baller? Basketball Performance Assessment from First-Person Videos , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[37]  Brendan Tran Morris,et al.  Learning to Score Olympic Events , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[38]  Baoxin Li,et al.  Relative Hidden Markov Models for Video-Based Evaluation of Motion Skills in Surgical Training , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[40]  Xiaogang Wang,et al.  HydraPlus-Net: Attentive Deep Features for Pedestrian Analysis , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[41]  Xiangyang Xue,et al.  Learning to score the figure skating sports videos , 2018, 1802.02774.

[42]  Yong Jae Lee,et al.  End-to-End Localization and Ranking for Relative Attributes , 2016, ECCV.

[43]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.