Adaptive Pooling in Multi-instance Learning for Web Video Annotation

Web videos are usually weakly annotated, i.e., a tag is associated to a video once the corresponding concept appears in a frame of this video without indicating when and where it occurs. These weakly annotated tags pose big troubles to many Web video applications, e.g. search and recommendation. In this paper, we present a new Web video annotation approach based on multi-instance learning (MIL) with a learnable pooling function. By formulating the Web video annotation as a MIL problem, we present an end-to-end deep network framework to solve this problem in which the frame (instance) level annotation is estimated from tags given at the video (bag of instances) level via a convolutional neural network (CNN). A learnable pooling function is proposed to adaptively fuse the outputs of the CNN to determine tags at the video level. We further propose a new loss function that consists of both bag-level and instance-level losses, which enables the penalty term to be aware of the internal state of network rather than only an overall loss, thus makes the pooling function learned better and faster. Experimental results demonstrate that our proposed framework is able to not only enhance the accuracy of Web video annotation by outperforming the state-of-the-art Web video annotation methods on the large-scale video dataset FCVID, but also help to infer the most relevant frames in Web videos.

[1]  James D. Keeler,et al.  Integrated Segmentation and Recognition of Hand-Printed Numerals , 1990, NIPS.

[2]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[3]  Jan Ramon,et al.  Multi instance neural networks , 2000, ICML 2000.

[4]  Yixin Chen,et al.  Image Categorization by Learning and Reasoning with Regions , 2004, J. Mach. Learn. Res..

[5]  Paul A. Viola,et al.  Multiple Instance Boosting for Object Detection , 2005, NIPS.

[6]  Cees G. M. Snoek,et al.  Early versus late fusion in semantic video analysis , 2005, MULTIMEDIA '05.

[7]  Rong Yan,et al.  Multiple instance learning for labeling faces in broadcasting news video , 2005, MULTIMEDIA '05.

[8]  Tao Mei,et al.  Correlative multi-label video annotation , 2007, ACM Multimedia.

[9]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Meng Wang,et al.  Beyond Distance Measurement: Constructing Neighborhood Similarity for Video Annotation , 2009, IEEE Transactions on Multimedia.

[11]  Jean Ponce,et al.  A Theoretical Analysis of Feature Pooling in Visual Recognition , 2010, ICML.

[12]  Yongdong Zhang,et al.  Web video retagging , 2011, Multimedia Tools and Applications.

[13]  Yongdong Zhang,et al.  Web video categorization based on Wikipedia categories and content-duplicated open resources , 2010, ACM Multimedia.

[14]  Alexander Zien,et al.  lp-Norm Multiple Kernel Learning , 2011, J. Mach. Learn. Res..

[15]  Zi Huang,et al.  Multiple feature hashing for real-time large scale near-duplicate video retrieval , 2011, ACM Multimedia.

[16]  Hsin-Min Wang,et al.  Automatic annotation of Web videos , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[17]  M. Kloft,et al.  l p -Norm Multiple Kernel Learning , 2011 .

[18]  Nicolas Le Roux,et al.  Ask the locals: Multi-way local pooling for image recognition , 2011, 2011 International Conference on Computer Vision.

[19]  Chong-Wah Ngo,et al.  Fast Semantic Diffusion for Large-Scale Context-Based Image and Video Annotation , 2012, IEEE Transactions on Image Processing.

[20]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[21]  Trevor Darrell,et al.  Beyond spatial pyramids: Receptive field learning for pooled image features , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Meng Wang,et al.  Event Driven Web Video Summarization by Tag Localization and Key-Shot Identification , 2012, IEEE Transactions on Multimedia.

[23]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[24]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[25]  Rob Fergus,et al.  Stochastic Pooling for Regularization of Deep Convolutional Neural Networks , 2013, ICLR.

[26]  Mario Fritz,et al.  Learning Smooth Pooling Regions for Visual Recognition , 2013, BMVC.

[27]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Chong-Wah Ngo,et al.  Annotation for free: video tagging by mining user search behavior , 2013, ACM Multimedia.

[29]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[30]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[31]  Yi Yang,et al.  How Related Exemplars Help Complex Event Detection in Web Videos? , 2013, 2013 IEEE International Conference on Computer Vision.

[32]  Yong Jae Lee,et al.  Weakly-supervised Discovery of Visual Pattern Configurations , 2014, NIPS.

[33]  Razvan Pascanu,et al.  Learned-Norm Pooling for Deep Feedforward and Recurrent Neural Networks , 2013, ECML/PKDD.

[34]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[35]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[36]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Iasonas Kokkinos,et al.  Untangling Local and Global Deformations in Deep Convolutional Networks for Image Classification and Sliding Window Detection , 2014, ArXiv.

[38]  Yue Gao,et al.  Exploiting Web Images for Semantic Video Indexing Via Robust Sample-Specific Loss , 2014, IEEE Transactions on Multimedia.

[39]  汪萌,et al.  Image Annotation By Multiple-Instance Learning With Discriminative Feature Mapping and Selection , 2014 .

[40]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Xi Wang,et al.  Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification , 2015, ACM Multimedia.

[42]  Ruslan Salakhutdinov,et al.  Action Recognition using Visual Attention , 2015, NIPS 2015.

[43]  G. K. Tam,et al.  Event Fisher Vectors: Robust Encoding Visual Diversity of Visual Streams , 2015 .

[44]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Ronan Collobert,et al.  From image-level to pixel-level labeling with Convolutional Networks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[47]  Ivan Laptev,et al.  Is object localization for free? - Weakly-supervised learning with convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Trevor Darrell,et al.  Fully Convolutional Multi-Class Multiple Instance Learning , 2014, ICLR.

[50]  Zhuowen Tu,et al.  Generalizing Pooling Functions in Convolutional Neural Networks: Mixed, Gated, and Tree , 2015, AISTATS.

[51]  Ali Farhadi,et al.  Actions ~ Transformations , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Brendan J. Frey,et al.  Classifying and segmenting microscopy images with deep multiple instance learning , 2015, Bioinform..

[53]  Yu-Gang Jiang,et al.  Harnessing Object and Scene Semantics for Large-Scale Video Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[55]  Yi Yang,et al.  Semantic Pooling for Complex Event Analysis in Untrimmed Videos , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Shih-Fu Chang,et al.  Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[57]  Shih-Fu Chang,et al.  Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification , 2017, IEEE Transactions on Multimedia.

[58]  Cees Snoek,et al.  VideoLSTM convolves, attends and flows for action recognition , 2016, Comput. Vis. Image Underst..