Who's Better, Who's Best: Skill Determination in Video using Deep Ranking

This paper presents a method for assessing skill of performance from video, for a variety of tasks, ranging from drawing to surgery and rolling dough. We formulate the problem as pairwise and overall ranking of video collections, and propose a supervised deep ranking model to learn discriminative features between pairs of videos exhibiting different amounts of skill. We utilise a two-stream Temporal Segment Network to capture both the type and quality of motions and the evolving task state. Results demonstrate our method is applicable to a variety of tasks, with the percentage of correctly ordered pairs of videos ranging from 70% to 82% for four datasets. We demonstrate the robustness of our approach via sensitivity analysis of its parameters. We see this work as effort toward the automated and objective organisation of how-to videos and overall, generic skill determination in video.

[1]  Brendan Tran Morris,et al.  Learning to Score Olympic Events , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[2]  Irfan Essa,et al.  Video Based Assessment of OSATS Using Sequential Motion Textures , 2014 .

[3]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[4]  Baoxin Li,et al.  Relative Hidden Markov Models for Video-Based Evaluation of Motion Skills in Surgical Training , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Gregory D. Hager,et al.  Pairwise Comparison-Based Objective Score for Automated Skill Assessment of Segments in a Surgical Task , 2014, IPCAI.

[6]  Antonio Torralba,et al.  Assessing the Quality of Actions , 2014, ECCV.

[7]  Jianbo Shi,et al.  Am I a Baller? Basketball Skill Assessment using First-Person Cameras , 2016, ArXiv.

[8]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[9]  Jessica K. Hodgins,et al.  Guide to the Carnegie Mellon University Multimodal Activity (CMU-MMAC) Database , 2008 .

[10]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[11]  Yachna Sharma,et al.  Automated video-based assessment of surgical skills for training and evaluation in medical schools , 2016, International Journal of Computer Assisted Radiology and Surgery.

[12]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[13]  Yang Song,et al.  Learning Fine-Grained Image Similarity with Deep Ranking , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Stanislav Kovacic,et al.  Trajectory Based Assessment of Coordinated Human Activity , 2003, ICVS.

[15]  Tao Mei,et al.  Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Horst Bischof,et al.  A Duality Based Approach for Realtime TV-L1 Optical Flow , 2007, DAGM-Symposium.

[18]  Samy Bengio,et al.  Large Scale Online Learning of Image Similarity through Ranking , 2009, IbPRIA.

[19]  Martin A. Giese,et al.  Estimation of Skill Levels in Sports Based on Hierarchical Spatio-Temporal Correspondences , 2003, DAGM-Symposium.

[20]  Stefan Wermter,et al.  Human motion assessment in real time using recurrent self-organization , 2016, 2016 25th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN).

[21]  Gregory D. Hager,et al.  A study of crowdsourced segment-level surgical skill assessment using pairwise rankings , 2015, International Journal of Computer Assisted Radiology and Surgery.

[22]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[23]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[24]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[25]  Baoxin Li,et al.  Video-based motion expertise analysis in simulation-based surgical training using hierarchical dirichlet process hidden markov model , 2011, MMAR '11.

[26]  Irfan A. Essa,et al.  Video and accelerometer-based motion analysis for automated surgical skills assessment , 2017, International Journal of Computer Assisted Radiology and Surgery.

[27]  Bülent Sankur,et al.  Graph-based analysis of physical exercise actions , 2013, MIIRH '13.

[28]  Henry C. Lin,et al.  JHU-ISI Gesture and Skill Assessment Working Set ( JIGSAWS ) : A Surgical Activity Dataset for Human Motion Modeling , 2014 .

[29]  Patrick Olivier,et al.  Automated surgical OSATS prediction from videos , 2014, 2014 IEEE 11th International Symposium on Biomedical Imaging (ISBI).

[30]  Irfan A. Essa,et al.  Automated Assessment of Surgical Skills Using Frequency Analysis , 2015, MICCAI.