Saliency Driven Object recognition in egocentric videos with deep CNN: toward application in assistance to Neuroprostheses

The problem of object recognition in natural scenes has been recently successfully addressed with Deep Convolutional Neuronal Networks giving a significant breakthrough in recognition scores. The computational efficiency of Deep CNNs as a function of their depth, allows for their use in real-time applications. One of the key issues here is to reduce the number of windows selected from images to be submitted to a Deep CNN. This is usually solved by preliminary segmentation and selection of specific windows, having outstanding " objective-ness " or other value of indicators of possible location of objects. In this paper we propose a Deep CNN approach and the general framework for recognition of objects in a real-time scenario and in an egocentric perspective. Here the window of interest is built on the basis of visual attention map computed over gaze fixations measured by a glass-worn eye-tracker. The application of this setup is an interactive user-friendly environment for upper-limb amputees. Vision has to help the subject to control his worn neuro-prosthesis in case of a small amount of remaining muscles when the EMG control becomes unefficient. The recognition results on a specifically recorded corpus of 151 videos with simple geometrical objects show the mAP of 64,6% and the computational time at the generalization lower than a time of a visual fixation on the object-of-interest.

[1]  B Hudgins,et al.  Myoelectric signal processing for control of powered limb prostheses. , 2006, Journal of electromyography and kinesiology : official journal of the International Society of Electrophysiological Kinesiology.

[2]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Xoana G. Troncoso,et al.  Microsaccades: a neurophysiological analysis , 2009, Trends in Neurosciences.

[4]  Konrad P. Körding,et al.  Multimodal decoding and congruent sensory information enhance reaching performance in subjects with cervical spinal cord injury , 2014, Front. Neurosci..

[5]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[6]  Jitendra Malik,et al.  Analyzing the Performance of Multilayer Neural Networks for Object Recognition , 2014, ECCV.

[7]  Georgios Meditskos,et al.  Semantic Event Fusion of Different Visual Modality Concepts for Activity Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  D. Farina,et al.  Linear and Nonlinear Regression Techniques for Simultaneous and Proportional Myoelectric Control , 2014, IEEE Transactions on Neural Systems and Rehabilitation Engineering.

[9]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[10]  Jitendra Malik,et al.  Region-Based Convolutional Networks for Accurate Object Detection and Segmentation , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  D.M. Durand,et al.  Localization and Recovery of Peripheral Neural Sources With Beamforming Algorithms , 2009, IEEE Transactions on Neural Systems and Rehabilitation Engineering.

[12]  Nathalie Guyader,et al.  How task difficulty influences eye movements when exploring natural scene images , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[13]  C. Koch,et al.  Computational modelling of visual attention , 2001, Nature Reviews Neuroscience.

[14]  Ivan Laptev,et al.  Is object localization for free? - Weakly-supervised learning with convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Thomas Deselaers,et al.  Measuring the Objectness of Image Windows , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Robert D. Lipschutz,et al.  Targeted muscle reinnervation for real-time myoelectric control of multifunction artificial arms. , 2009, JAMA.

[18]  Deva Ramanan,et al.  Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Blair A. Lock,et al.  Determining the Optimal Window Length for Pattern Recognition-Based Myoelectric Control: Balancing the Competing Effects of Classification Error and Controller Delay , 2011, IEEE Transactions on Neural Systems and Rehabilitation Engineering.

[20]  Dario Farina,et al.  Sensor fusion and computer vision for context-aware control of a multi degree-of-freedom prosthesis , 2015, Journal of neural engineering.

[21]  Ning Jiang,et al.  Extracting Simultaneous and Proportional Neural Control Information for Multiple-DOF Prostheses From the Surface Electromyographic Signal , 2009, IEEE Transactions on Biomedical Engineering.

[22]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[23]  D. S. Wooding,et al.  Fixation maps: quantifying eye-movement traces , 2002, ETRA.

[24]  Dario Farina,et al.  The Extraction of Neural Information from the Surface EMG for the Control of Upper-Limb Prostheses: Emerging Avenues and Challenges , 2014, IEEE Transactions on Neural Systems and Rehabilitation Engineering.

[25]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[26]  Chokri Ben Amar,et al.  Deep Learning for Saliency Prediction in Natural Video , 2016, ArXiv.

[27]  Christel Chamaret,et al.  Spatio-temporal combination of saliency maps and eye-tracking assessment of different strategies , 2010, 2010 IEEE International Conference on Image Processing.

[28]  Dario Farina,et al.  Stereovision and augmented reality for closed-loop control of grasping in hand prostheses , 2014, Journal of neural engineering.

[29]  Giuseppe Riva,et al.  Visual exploration patterns of human figures in action: an eye tracker study with art paintings , 2015, Front. Psychol..

[30]  Jenny Benois-Pineau,et al.  Perceptual modeling in the problem of active object recognition in visual scenes , 2016, Pattern Recognit..

[31]  Rahman Davoodi,et al.  Evaluation of a Noninvasive Command Scheme for Upper-Limb Prostheses in a Virtual Reality Reach and Grasp Task , 2013, IEEE Transactions on Biomedical Engineering.

[32]  Yann LeCun,et al.  Pedestrian Detection with Unsupervised Multi-stage Feature Learning , 2012, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Eric J Perreault,et al.  Decoding with limited neural data: a mixture of time-warped trajectory models for directional reaches , 2012, Journal of neural engineering.

[35]  Jenny Benois-Pineau,et al.  Goal-oriented top-down probabilistic visual attention model for recognition of manipulated objects in egocentric videos , 2015, Signal Process. Image Commun..

[36]  Ilmério Reis da Silva,et al.  Spatial Locality Weighting of Features Using Saliency Map with a Bag-of-Visual-Words Approach , 2012, 2012 IEEE 24th International Conference on Tools with Artificial Intelligence.

[37]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[38]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.