Sensor substitution for video-based action recognition

There are many applications where domain-specific sensing, such as accelerometers, kinematics, or force sensing, provide unique and important information for control or for analysis of motion. However, it is not always the case that these sensors can be deployed or accessed beyond laboratory environments. For example, it is possible to instrument humans or robots to measure motion in the laboratory in ways that it is not possible to replicate in the wild. An alternative, which we explore in this paper, is to address situations where accurate sensing is available while training an algorithm, but for which only video is available for deployment. We present two examples of this sensory substitution methodology. The first variation trains a convolutional neural network to regress real-valued signals, including robot end-effector pose, from video. The second example regresses binary signals derived from accelerometer data which signifies when specific objects are in motion. We evaluate these on the JIGSAWS dataset for robotic surgery training assessment and the 50 Salads dataset for modeling complex structured cooking tasks. We evaluate the trained models for video-based action recognition and show that the trained models provide information that is comparable to the sensory signals they replace.

[1]  Cordelia Schmid,et al.  P-CNN: Pose-Based CNN Features for Action Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  Cees Snoek,et al.  What do 15,000 object categories tell us about classifying and localizing actions? , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Andrew Zisserman,et al.  Deep Convolutional Neural Networks for Efficient Pose Estimation in Gesture Videos , 2014, ACCV.

[4]  Xi Wang,et al.  Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification , 2015, ACM Multimedia.

[5]  Nassir Navab,et al.  Robust Optimization for Deep Regression , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Gregory D. Hager,et al.  Efficient Segmental Inference for Spatiotemporal Modeling of Fine-grained Actions , 2016, ArXiv.

[7]  Gordon Cheng,et al.  Automatic segmentation and recognition of human activities from observation based on semantic reasoning , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[8]  René Vidal,et al.  Surgical Gesture Classification from Video Data , 2012, MICCAI.

[9]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[10]  V. Krovi,et al.  Robotic Minimally Invasive Surgical skill assessment based on automated video-analysis motion studies , 2012, 2012 4th IEEE RAS & EMBS International Conference on Biomedical Robotics and Biomechatronics (BioRob).

[11]  Kevin Cleary,et al.  OR 2020 workshop report: Operating room of the future , 2005 .

[12]  Gregory D. Hager,et al.  An Improved Model for Segmentation and Recognition of Fine-Grained Activities with Application to Surgical Training Tasks , 2015, 2015 IEEE Winter Conference on Applications of Computer Vision.

[13]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[14]  Gregory D. Hager,et al.  Surgical Gesture Segmentation and Recognition , 2013, MICCAI.

[15]  Christian Szegedy,et al.  DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[17]  Bernt Schiele,et al.  Recognizing Fine-Grained and Composite Activities Using Hand-Centric Features and Script Data , 2015, International Journal of Computer Vision.

[18]  Stephen J. McKenna,et al.  Combining embedded accelerometers with computer vision for recognizing food preparation activities , 2013, UbiComp.

[19]  Sanjeev Khudanpur,et al.  Learning and inference algorithms for dynamical system models of dextrous motion , 2011 .

[20]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Gregory D. Hager,et al.  Segmental Spatiotemporal CNNs for Fine-Grained Action Segmentation , 2016, ECCV.

[22]  Thomas Brox,et al.  Striving for Simplicity: The All Convolutional Net , 2014, ICLR.

[23]  Bingbing Ni,et al.  Multiple Granularity Analysis for Fine-Grained Action Detection , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Gregory D. Hager,et al.  Learning convolutional action primitives for fine-grained action recognition , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[25]  Henry C. Lin,et al.  Towards automatic skill evaluation: Detection and segmentation of robot-assisted surgical motions , 2006, Computer aided surgery : official journal of the International Society for Computer Aided Surgery.

[26]  Andrea Vedaldi,et al.  MatConvNet: Convolutional Neural Networks for MATLAB , 2014, ACM Multimedia.

[27]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Dieter Fox,et al.  Fine-grained kitchen activity recognition using RGB-D , 2012, UbiComp.

[29]  Michelle C Specht,et al.  Effects of limited work hours on surgical training. , 2002, Journal of the American College of Surgeons.

[30]  Xiaoou Tang,et al.  Facial Landmark Detection by Deep Multi-task Learning , 2014, ECCV.

[31]  Irfan A. Essa,et al.  Automated Assessment of Surgical Skills Using Frequency Analysis , 2015, MICCAI.

[32]  Jason J. Corso,et al.  Product of tracking experts for visual tracking of surgical tools , 2013, 2013 IEEE International Conference on Automation Science and Engineering (CASE).

[33]  Bolei Zhou,et al.  Object Detectors Emerge in Deep Scene CNNs , 2014, ICLR.

[34]  Andrew Zisserman,et al.  Reading Text in the Wild with Convolutional Neural Networks , 2014, International Journal of Computer Vision.

[35]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[36]  Bernt Schiele,et al.  A database for fine grained activity detection of cooking activities , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  James M. Rehg,et al.  Modeling Actions through State Changes , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..