Semi-Automatic Annotation with Predicted Visual Saliency Maps for Object Recognition in Wearable Video

Recognition of objects of a given category in visual content is one of the key problems in computer vision and multimedia. It is strongly needed in wearable video shooting for a wide range of important applications in society. Supervised learning approaches are proved to be the most efficient in this task. They require available ground truth for training models. It is specifically true for Deep Convolution Networks, but is also hold for other popular models such as SVM on visual signatures. Annotation of ground truth when drawing bounding boxes (BB) is a very tedious task requiring important human resource. The research in prediction of visual attention in images and videos has attained maturity, specifically in what concerns bottom-up visual attention modeling. Hence, instead of annotating the ground truth manually with BB we propose to use automatically predicted salient areas as object locators for annotation. Such a prediction of saliency is not perfect, nevertheless. Hence active contours models on saliency maps are used in order to isolate the most prominent areas covering the objects. The approach is tested in the framework of a well-studied supervised learning model by SVM with psycho-visual weighted Bag-of-Words. An egocentric GTEA dataset was used in the experiment. The difference in mAP (mean average precision) is less than 10 percent while the mean annotation time is 36% lower.

[1]  Tony F. Chan,et al.  Active contours without edges , 2001, IEEE Trans. Image Process..

[2]  Jenny Benois-Pineau,et al.  Modeling instrumental activities of daily living in egocentric vision as sequences of active objects and context for alzheimer disease research , 2013, MIIRH '13.

[3]  Jenny Benois-Pineau,et al.  Goal-oriented top-down probabilistic visual attention model for recognition of manipulated objects in egocentric videos , 2015, Signal Process. Image Commun..

[4]  David A. McAllester,et al.  A discriminatively trained, multiscale, deformable part model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Demetri Terzopoulos,et al.  Snakes: Active contour models , 2004, International Journal of Computer Vision.

[6]  Rafik Bouaziz,et al.  Automation of the semantic annotation of web resources , 2014 .

[7]  Mihai Datcu,et al.  Visualization-Based Active Learning for the Annotation of SAR Images , 2015, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[8]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[9]  Jenny Benois-Pineau,et al.  Saliency Driven Object recognition in egocentric videos with deep CNN: toward application in assistance to Neuroprostheses , 2016, Comput. Vis. Image Underst..

[10]  James Ze Wang,et al.  Real-Time Computerized Annotation of Pictures , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Tony F. Chan,et al.  Active Contours without Edges for Vector-Valued Images , 2000, J. Vis. Commun. Image Represent..

[12]  Hironobu Takagi,et al.  Recognizing hand-object interactions in wearable camera videos , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[13]  Alberto Del Bimbo,et al.  A System for Video Recommendation using Visual Saliency, Crowdsourced and Automatic Annotations , 2015, ACM Multimedia.

[14]  Paul Benjamin,et al.  Object Recognition Using Deep Neural Networks: A Survey , 2014, ArXiv.

[15]  Ali Borji,et al.  State-of-the-Art in Visual Attention Modeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Rafik Bouaziz,et al.  Automation and evaluation of the semantic annotation of Web resources , 2013, 8th International Conference for Internet Technology and Secured Transactions (ICITST-2013).

[17]  James M. Rehg,et al.  Learning to recognize objects in egocentric activities , 2011, CVPR 2011.

[18]  Paolo Missier,et al.  Bootstrapping Personalised Human Activity Recognition Models Using Online Active Learning , 2015, 2015 IEEE International Conference on Computer and Information Technology; Ubiquitous Computing and Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and Computing.

[19]  Jenny Benois-Pineau,et al.  Fusion of Multiple Visual Cues for Visual Saliency Extraction from Wearable Camera Settings with Strong Motion , 2012, ECCV Workshops.

[20]  D. Mumford,et al.  Optimal approximations by piecewise smooth functions and associated variational problems , 1989 .