Who's Better? Who's Best? Pairwise Deep Ranking for Skill Determination

This paper presents a method for assessing skill from video, applicable to a variety of tasks, ranging from surgery to drawing and rolling pizza dough. We formulate the problem as pairwise (who's better?) and overall (who's best?) ranking of video collections, using supervised deep ranking. We propose a novel loss function that learns discriminative features when a pair of videos exhibit variance in skill, and learns shared features when a pair of videos exhibit comparable skill levels. Results demonstrate our method is applicable across tasks, with the percentage of correctly ordered pairs of videos ranging from 70% to 83% for four datasets. We demonstrate the robustness of our approach via sensitivity analysis of its parameters. We see this work as effort toward the automated organization of how-to video collections and overall, generic skill determination in video.

[1]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[2]  Yachna Sharma,et al.  Automated video-based assessment of surgical skills for training and evaluation in medical schools , 2016, International Journal of Computer Assisted Radiology and Surgery.

[3]  Ivan Laptev,et al.  Joint Discovery of Object States and Manipulation Actions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4]  Patrick Olivier,et al.  Automated surgical OSATS prediction from videos , 2014, 2014 IEEE 11th International Symposium on Biomedical Imaging (ISBI).

[5]  Irfan A. Essa,et al.  Automated Assessment of Surgical Skills Using Frequency Analysis , 2015, MICCAI.

[6]  Irfan A. Essa,et al.  Automated surgical skill assessment in RMIS training , 2017, International Journal of Computer Assisted Radiology and Surgery.

[7]  Horst Bischof,et al.  A Duality Based Approach for Realtime TV-L1 Optical Flow , 2007, DAGM-Symposium.

[8]  Brendan Tran Morris,et al.  Learning to Score Olympic Events , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[9]  Yong Man Ro,et al.  EvaluationNet: Can Human Skill be Evaluated by Deep Networks? , 2017, ArXiv.

[10]  Baoxin Li,et al.  Relative Hidden Markov Models for Video-Based Evaluation of Motion Skills in Surgical Training , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[12]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[13]  Stanislav Kovacic,et al.  Trajectory Based Assessment of Coordinated Human Activity , 2003, ICVS.

[14]  Tao Mei,et al.  Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Irfan A. Essa,et al.  Video and accelerometer-based motion analysis for automated surgical skills assessment , 2017, International Journal of Computer Assisted Radiology and Surgery.

[16]  Yang Song,et al.  Learning Fine-Grained Image Similarity with Deep Ranking , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Kristen Grauman,et al.  Relative attributes , 2011, 2011 International Conference on Computer Vision.

[18]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[19]  Zhe L. Lin,et al.  Top-Down Neural Attention by Excitation Backprop , 2016, International Journal of Computer Vision.

[20]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[21]  Rebecca MacDonell-Yilmaz,et al.  What You Do. , 2019, The American journal of nursing.

[22]  Gregory D. Hager,et al.  A study of crowdsourced segment-level surgical skill assessment using pairwise rankings , 2015, International Journal of Computer Assisted Radiology and Surgery.

[23]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[24]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[25]  Baoxin Li,et al.  Video-based motion expertise analysis in simulation-based surgical training using hierarchical dirichlet process hidden markov model , 2011, MMAR '11.

[26]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[27]  Irfan Essa,et al.  Video Based Assessment of OSATS Using Sequential Motion Textures , 2014 .

[28]  Martin A. Giese,et al.  Estimation of Skill Levels in Sports Based on Hierarchical Spatio-Temporal Correspondences , 2003, DAGM-Symposium.

[29]  Stefan Wermter,et al.  Human motion assessment in real time using recurrent self-organization , 2016, 2016 25th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN).

[30]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[31]  Gregory D. Hager,et al.  A Dataset and Benchmarks for Segmentation and Recognition of Gestures in Robotic Surgery , 2017, IEEE Transactions on Biomedical Engineering.

[32]  Dima Damen,et al.  You-Do, I-Learn: Egocentric unsupervised discovery of objects and their modes of interaction towards video-based guidance , 2016, Comput. Vis. Image Underst..

[33]  Gregory D. Hager,et al.  Pairwise Comparison-Based Objective Score for Automated Skill Assessment of Segments in a Surgical Task , 2014, IPCAI.

[34]  Antonio Torralba,et al.  Assessing the Quality of Actions , 2014, ECCV.

[35]  Christopher J. C. Burges,et al.  From RankNet to LambdaRank to LambdaMART: An Overview , 2010 .

[36]  Bülent Sankur,et al.  Graph-based analysis of physical exercise actions , 2013, MIIRH '13.

[37]  Jessica K. Hodgins,et al.  Guide to the Carnegie Mellon University Multimodal Activity (CMU-MMAC) Database , 2008 .

[38]  Jaana Kekäläinen,et al.  IR evaluation methods for retrieving highly relevant documents , 2000, SIGIR '00.

[39]  Jianbo Shi,et al.  Am I a Baller? Basketball Performance Assessment from First-Person Videos , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Ivan Laptev,et al.  Unsupervised Learning from Narrated Instruction Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).