论文信息 - Who's Better, Who's Best: Skill Determination in Video using Deep Ranking

Who's Better, Who's Best: Skill Determination in Video using Deep Ranking

This paper presents a method for assessing skill of performance from video, for a variety of tasks, ranging from drawing to surgery and rolling dough. We formulate the problem as pairwise and overall ranking of video collections, and propose a supervised deep ranking model to learn discriminative features between pairs of videos exhibiting different amounts of skill. We utilise a two-stream Temporal Segment Network to capture both the type and quality of motions and the evolving task state. Results demonstrate our method is applicable to a variety of tasks, with the percentage of correctly ordered pairs of videos ranging from 70% to 82% for four datasets. We demonstrate the robustness of our approach via sensitivity analysis of its parameters. We see this work as effort toward the automated and objective organisation of how-to videos and overall, generic skill determination in video.

[1] Brendan Tran Morris,et al. Learning to Score Olympic Events , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[2] Irfan Essa,et al. Video Based Assessment of OSATS Using Sequential Motion Textures , 2014 .

[3] Thomas Serre,et al. HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[4] Baoxin Li,et al. Relative Hidden Markov Models for Video-Based Evaluation of Motion Skills in Surgical Training , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5] Gregory D. Hager,et al. Pairwise Comparison-Based Objective Score for Automated Skill Assessment of Segments in a Surgical Task , 2014, IPCAI.

[6] Antonio Torralba,et al. Assessing the Quality of Actions , 2014, ECCV.

[7] Jianbo Shi,et al. Am I a Baller? Basketball Skill Assessment using First-Person Cameras , 2016, ArXiv.

[8] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[9] Jessica K. Hodgins,et al. Guide to the Carnegie Mellon University Multimodal Activity (CMU-MMAC) Database , 2008 .

[10] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[11] Yachna Sharma,et al. Automated video-based assessment of surgical skills for training and evaluation in medical schools , 2016, International Journal of Computer Assisted Radiology and Surgery.

[12] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[13] Yang Song,et al. Learning Fine-Grained Image Similarity with Deep Ranking , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[14] Stanislav Kovacic,et al. Trajectory Based Assessment of Coordinated Human Activity , 2003, ICVS.

[15] Tao Mei,et al. Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Trevor Darrell,et al. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[17] Horst Bischof,et al. A Duality Based Approach for Realtime TV-L1 Optical Flow , 2007, DAGM-Symposium.

[18] Samy Bengio,et al. Large Scale Online Learning of Image Similarity through Ranking , 2009, IbPRIA.

[19] Martin A. Giese,et al. Estimation of Skill Levels in Sports Based on Hierarchical Spatio-Temporal Correspondences , 2003, DAGM-Symposium.

[20] Stefan Wermter,et al. Human motion assessment in real time using recurrent self-organization , 2016, 2016 25th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN).

[21] Gregory D. Hager,et al. A study of crowdsourced segment-level surgical skill assessment using pairwise rankings , 2015, International Journal of Computer Assisted Radiology and Surgery.

[22] Luc Van Gool,et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[23] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[24] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[25] Baoxin Li,et al. Video-based motion expertise analysis in simulation-based surgical training using hierarchical dirichlet process hidden markov model , 2011, MMAR '11.

[26] Irfan A. Essa,et al. Video and accelerometer-based motion analysis for automated surgical skills assessment , 2017, International Journal of Computer Assisted Radiology and Surgery.

[27] Bülent Sankur,et al. Graph-based analysis of physical exercise actions , 2013, MIIRH '13.

[28] Henry C. Lin,et al. JHU-ISI Gesture and Skill Assessment Working Set ( JIGSAWS ) : A Surgical Activity Dataset for Human Motion Modeling , 2014 .

[29] Patrick Olivier,et al. Automated surgical OSATS prediction from videos , 2014, 2014 IEEE 11th International Symposium on Biomedical Imaging (ISBI).

[30] Irfan A. Essa,et al. Automated Assessment of Surgical Skills Using Frequency Analysis , 2015, MICCAI.