PhD thesis: objects for spatio-temporal activity recognition in videos

This thesis investigates the role of objects for the spatio-temporal recognition of activities in videos. More specifically, we investigate what, when, and where specific activities occur in visual content by examining object representations, centered around the main question: what do objects tell about the extent of activities in visual space and time? The thesis presents six works on this topic. First, the spatial extent of activities is investigated using objects and their parts. Second, over two works, it is investigated whether activities exhibit different object preferences over time and which objects matter for representing activities. Third, the full spatio-temporal extent of activities is investigated, where over three works the extensive annotation burden of action localization is replaced respectively with point annotations, pseudo-annotations, and a zero-shot setting, where no video examples are given during training. The works lead to the conclusion that objects provide valuable information about the presence and spatio-temporal extent of activities in videos.

[1]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Ivan Laptev,et al.  Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  In-So Kweon,et al.  Multi-scale pyramid pooling for deep convolutional representation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[4]  Yunde Jia,et al.  Discriminatively Trained And-Or Tree Models for Object Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[6]  Shaogang Gong,et al.  Semantic embedding space for zero-shot action recognition , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[7]  Shaogang Gong,et al.  Transductive Multi-view Embedding for Zero-Shot Recognition and Annotation , 2014, ECCV.

[8]  Cordelia Schmid,et al.  The LEAR submission at Thumos 2014 , 2014 .

[9]  Nanning Zheng,et al.  Video Object Discovery and Co-Segmentation with Extremely Weak Supervision , 2017, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Cees Snoek,et al.  VideoStory: A New Multimedia Embedding for Few-Example Recognition and Translation of Events , 2014, ACM Multimedia.

[11]  CioccaGianluigi,et al.  An interactive tool for manual, semi-automatic and automatic video annotation , 2015 .

[12]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[13]  Chenliang Xu,et al.  Can humans fly? Action understanding with multiple classes of actors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Ramakant Nevatia,et al.  ISOMER: Informative Segment Observations for Multimedia Event Recounting , 2014, ICMR.

[15]  Tianbao Yang,et al.  Learning Attributes Equals Multi-Source Domain Generalization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Yu Qiao,et al.  Action Recognition with Stacked Fisher Vectors , 2014, ECCV.

[17]  Theo Gevers,et al.  Evaluation of Color Spatio-Temporal Interest Points for Human Action Recognition , 2014, IEEE Transactions on Image Processing.

[18]  Alexei A. Efros,et al.  What makes Paris look like Paris? , 2015, Commun. ACM.

[19]  Rama Chellappa,et al.  Machine Recognition of Human Activities: A Survey , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[20]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[21]  Yi Yang,et al.  Fast and Accurate Content-based Semantic Search in 100M Internet Videos , 2015, ACM Multimedia.

[22]  Sangmin Oh,et al.  Compositional Models for Video Event Detection: A Multiple Kernel Learning Latent Variable Approach , 2013, 2013 IEEE International Conference on Computer Vision.

[23]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[24]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[25]  Nuno Vasconcelos,et al.  Multiclass Boosting: Theory and Algorithms , 2011, NIPS.

[26]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[27]  Haroon Idrees,et al.  Action Localization in Videos through Context Walk , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[28]  Yu-Gang Jiang,et al.  Harnessing Object and Scene Semantics for Large-Scale Video Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[30]  Yi Yang,et al.  A discriminative CNN video representation for event detection , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Ramakant Nevatia,et al.  DISCOVER: Discovering Important Segments for Classification of Video Events and Recounting , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Alexei A. Efros,et al.  Unsupervised Discovery of Mid-Level Discriminative Patches , 2012, ECCV.

[33]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Carsten Rother,et al.  Weakly supervised discriminative localization and classification: a joint learning process , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[35]  Lluís Màrquez i Villodre,et al.  Boosting Applied to Word Sense Disambiguation , 2000, ArXiv.

[36]  Cees Snoek,et al.  Spatial-Aware Object Embeddings for Zero-Shot Localization and Classification of Actions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[37]  Jitendra Malik,et al.  Finding action tubes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Cees Snoek,et al.  COSTA: Co-Occurrence Statistics for Zero-Shot Classification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Deli Zhao,et al.  Recognizing an Action Using Its Name: A Knowledge-Based Approach , 2016, International Journal of Computer Vision.

[40]  David Mihalcik,et al.  The Design and Implementation of ViPER , 2005 .

[41]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Tinne Tuytelaars,et al.  Modeling video evolution for action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Xun Xu,et al.  Multi-Task Zero-Shot Action Recognition with Prioritised Data Augmentation , 2016, ECCV.

[44]  Cees Snoek,et al.  No spare parts: Sharing part detectors for image categorization , 2015, Comput. Vis. Image Underst..

[45]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[46]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Zhiqiang Shen,et al.  Multiple Granularity Descriptors for Fine-Grained Categorization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[48]  Ramakant Nevatia,et al.  Evaluating multimedia features and fusion for example-based event detection , 2013, Machine Vision and Applications.

[49]  Alexei A. Efros,et al.  An empirical study of context in object detection , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Koen E. A. van de Sande,et al.  All vehicles are cars: subclass preferences in container concepts , 2012, ICMR '12.

[51]  Trevor Darrell,et al.  Part-Based R-CNNs for Fine-Grained Category Detection , 2014, ECCV.

[52]  Gang Hua,et al.  Semantic Model Vectors for Complex Video Event Recognition , 2012, IEEE Transactions on Multimedia.

[53]  James M. Rehg,et al.  A Scalable Approach to Activity Recognition based on Object Use , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[54]  Larry S. Davis,et al.  Objects in Action: An Approach for Combining Action Understanding and Object Perception , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[55]  Silvio Savarese,et al.  Recognizing human actions by attributes , 2011, CVPR 2011.

[56]  Stefan Carlsson,et al.  Spotlight the Negatives: A Generalized Discriminative Latent Model , 2015, BMVC.

[57]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[58]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[59]  Luc Van Gool,et al.  An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.

[60]  Bolei Zhou,et al.  Object Detectors Emerge in Deep Scene CNNs , 2014, ICLR.

[61]  Philip H. S. Torr,et al.  BING: Binarized normed gradients for objectness estimation at 300fps , 2014, Computational Visual Media.

[62]  A. Needham Object recognition and object segregation in 4.5-month-old infants. , 2001, Journal of experimental child psychology.

[63]  Limin Wang,et al.  Video Action Detection with Relational Dynamic-Poselets , 2014, ECCV.

[64]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[65]  Cordelia Schmid,et al.  Towards Understanding Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[66]  Tao Xiang,et al.  In Defence of Negative Mining for Annotating Weakly Labelled Data , 2012, ECCV.

[67]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Cees Snoek,et al.  Objects2action: Classifying and Localizing Actions without Any Video Example , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[69]  Dong Liu,et al.  EventNet: A Large Scale Structured Concept Library for Complex Event Detection in Video , 2015, ACM Multimedia.

[70]  Dong Han,et al.  Selection and context for action recognition , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[71]  Cordelia Schmid,et al.  Action and Event Recognition with Fisher Vectors on a Compact Feature Set , 2013, 2013 IEEE International Conference on Computer Vision.

[72]  Mahmood Fathy,et al.  Multi-label Discriminative Weakly-Supervised Human Activity Recognition and Localization , 2014, ACCV.

[73]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[74]  Gang Yu,et al.  Fast action proposals for human action detection and search , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[75]  Koichi Shinoda,et al.  Adaptation of Word Vectors using Tree Structure for Visual Semantics , 2016, ACM Multimedia.

[76]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[77]  Yi Yang,et al.  Searching Persuasively: Joint Event Detection and Evidence Recounting with Limited Supervision , 2015, ACM Multimedia.

[78]  Mubarak Shah,et al.  What If We Do Not have Multiple Videos of the Same Action? — Video Action Localization Using Web Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[79]  Antonio Torralba,et al.  Sharing Visual Features for Multiclass and Multiview Object Detection , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[80]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[81]  Cordelia Schmid,et al.  Spatio-temporal Object Detection Proposals , 2014, ECCV.

[82]  Vladlen Koltun,et al.  Geodesic Object Proposals , 2014, ECCV.

[83]  C. V. Jawahar,et al.  Blocks That Shout: Distinctive Parts for Scene Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[84]  Cordelia Schmid,et al.  Explicit Modeling of Human-Object Interactions in Realistic Videos , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[85]  Yao Li,et al.  Mid-level deep pattern mining , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[86]  Ran Xu,et al.  Human action segmentation with hierarchical supervoxel consistency , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[87]  Anuj Srivastava,et al.  Action Recognition Using Rate-Invariant Analysis of Skeletal Shape Trajectories , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[88]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[89]  Chunheng Wang,et al.  Robust relative attributes for human action recognition , 2013, Pattern Analysis and Applications.

[90]  Cees Snoek,et al.  Spot On: Action Localization from Pointly-Supervised Proposals , 2016, ECCV.

[91]  Dennis Koelma,et al.  The ImageNet Shuffle: Reorganized Pre-training for Video Event Detection , 2016, ICMR.

[92]  Jonathan Krause,et al.  Fine-grained recognition without part annotations , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[93]  Michael Felsberg,et al.  Coloring Action Recognition in Still Images , 2013, International Journal of Computer Vision.

[94]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[95]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[96]  Jitendra Malik,et al.  Discriminative Decorrelation for Clustering and Classification , 2012, ECCV.

[97]  Florian Metze,et al.  Beyond audio and video retrieval: towards multimedia summarization , 2012, ICMR.

[98]  Ivan Laptev,et al.  Is object localization for free? - Weakly-supervised learning with convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[99]  Xun Xu,et al.  Transductive Zero-Shot Action Recognition by Word-Vector Embedding , 2015, International Journal of Computer Vision.

[100]  Ali Farhadi,et al.  Recognition using visual phrases , 2011, CVPR 2011.

[101]  Günther Eibl,et al.  Multiclass Boosting for Weak Classifiers , 2005, J. Mach. Learn. Res..

[102]  Derek Hoiem,et al.  Learning Collections of Part Models for Object Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[103]  Svetlana Lazebnik,et al.  Multi-scale Orderless Pooling of Deep Convolutional Activation Features , 2014, ECCV.

[104]  Vittorio Ferrari,et al.  Object localization in ImageNet by looking out of the window , 2015, BMVC.

[105]  Ivan Laptev,et al.  Improving bag-of-features action recognition with non-local cues , 2010, BMVC.

[106]  Jiebo Luo,et al.  Mining compositional features for boosting , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[107]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[108]  Yi Yang,et al.  Concepts Not Alone: Exploring Pairwise Relationships for Zero-Shot Video Activity Recognition , 2016, AAAI.

[109]  Amir Roshan Zamir,et al.  Action Recognition in Realistic Sports Videos , 2014 .

[110]  Masoud Mazloom,et al.  Querying for video events by semantic signatures from few examples , 2013, MM '13.

[111]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[112]  Hui Cheng,et al.  Video event recognition using concept attributes , 2013, 2013 IEEE Workshop on Applications of Computer Vision (WACV).

[113]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[114]  Gang Wang,et al.  Exemplar based Deep Discriminative and Shareable Feature Learning for scene image classification , 2015, Pattern Recognit..

[115]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[116]  Shuang Wu,et al.  Zero-Shot Event Detection Using Multi-modal Fusion of Weakly Supervised Concepts , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[117]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[118]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[119]  Qiang Chen,et al.  Contextualizing Object Detection and Classification , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[120]  Jia Xu,et al.  Learning to segment under various forms of weak supervision , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[121]  Tinne Tuytelaars,et al.  Weakly supervised object detection with convex clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[122]  Martial Hebert,et al.  Watch and learn: Semi-supervised learning of object detectors from videos , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[123]  Fei-Fei Li,et al.  Grouplet: A structured image representation for recognizing human and object interactions , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[124]  Mubarak Shah,et al.  Spatiotemporal Deformable Part Models for Action Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[125]  Shaogang Gong,et al.  Unsupervised Domain Adaptation for Zero-Shot Learning , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[126]  Rémi Ronfard,et al.  A survey of vision-based methods for action representation, segmentation and recognition , 2011, Comput. Vis. Image Underst..

[127]  Masoud Mazloom,et al.  Conceptlets: Selective Semantics for Classifying Video Events , 2014, IEEE Transactions on Multimedia.

[128]  Nicu Sebe,et al.  Unsupervised Tube Extraction Using Transductive Learning and Dense Trajectories , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[129]  Cees Snoek,et al.  Video Stream Retrieval of Unseen Queries using Semantic Memory , 2016, BMVC.

[130]  Deva Ramanan,et al.  Video Annotation and Tracking with Active Learning , 2011, NIPS.

[131]  Jeffrey M. Zacks,et al.  Understanding events : from perception to action , 2008 .

[132]  Ivan Laptev,et al.  Recognizing human actions in still images: a study of bag-of-features and part-based representations , 2010, BMVC.

[133]  Antonio Torralba,et al.  Unsupervised Detection of Regions of Interest Using Iterative Link Analysis , 2009, NIPS.

[134]  Baoxin Li,et al.  Recognizing unseen actions in a domain-adapted embedding space , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[135]  Shih-Fu Chang,et al.  Localizing Actions from Video Labels and Pseudo-Annotations , 2017, BMVC.

[136]  Xiaogang Wang,et al.  Object Detection from Video Tubelets with Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[137]  C. Lawrence Zitnick,et al.  Edge Boxes: Locating Object Proposals from Edges , 2014, ECCV.

[138]  Marcel Simon,et al.  Neural Activation Constellations: Unsupervised Part Model Discovery with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[139]  Subhransu Maji,et al.  Bilinear CNN Models for Fine-Grained Visual Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[140]  Jean Ponce,et al.  Unsupervised Object Discovery and Tracking in Video Collections , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[141]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[142]  Christoph H. Lampert,et al.  Attribute-Based Classification for Zero-Shot Visual Object Categorization , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[143]  Jean Ponce,et al.  Learning Discriminative Part Detectors for Image Classification and Cosegmentation , 2013, 2013 IEEE International Conference on Computer Vision.

[144]  Yang Wang,et al.  Discriminative figure-centric models for joint action localization and recognition , 2011, 2011 International Conference on Computer Vision.

[145]  Ling Shao,et al.  Learning Spatio-Temporal Representations for Action Recognition: A Genetic Programming Approach , 2016, IEEE Transactions on Cybernetics.

[146]  Shih-Fu Chang,et al.  Minimally Needed Evidence for Complex Event Recognition in Unconstrained Videos , 2014, ICMR.

[147]  Iasonas Kokkinos,et al.  Discovering discriminative action parts from mid-level video representations , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[148]  Hui Cheng,et al.  Multimedia event recounting with concept based representation , 2012, ACM Multimedia.

[149]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[150]  Fei-Fei Li,et al.  What's the Point: Semantic Segmentation with Point Supervision , 2015, ECCV.

[151]  Li Fei-Fei,et al.  Classifying Actions and Measuring Action Similarity by Modeling the Mutual Context of Objects and Human Poses , 2011 .

[152]  Zicheng Liu,et al.  Cross-dataset action detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[153]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[154]  Joachim Denzler,et al.  Exemplar-Specific Patch Features for Fine-Grained Recognition , 2014, GCPR.

[155]  Bo Zhang,et al.  A Formal Study of Shot Boundary Detection , 2007, IEEE Transactions on Circuits and Systems for Video Technology.

[156]  Juan Carlos Niebles,et al.  Spatio-temporal Human-Object Interactions for Action Recognition in Videos , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[157]  Wei Chen,et al.  Action Detection by Implicit Intentional Motion Clustering , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[158]  Cees Snoek,et al.  APT: Action localization proposals from dense trajectories , 2015, BMVC.

[159]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[160]  Alexei A. Efros,et al.  Mid-level Visual Element Discovery as Discriminative Mode Seeking , 2013, NIPS.

[161]  Cordelia Schmid,et al.  Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[162]  Cordelia Schmid,et al.  Discriminative spatial saliency for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[163]  Cordelia Schmid,et al.  Combining efficient object localization and image classification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[164]  Matthieu Guillaumin,et al.  Food-101 - Mining Discriminative Components with Random Forests , 2014, ECCV.

[165]  Renée Baillargeon,et al.  Event categorization in infancy , 2002, Trends in Cognitive Sciences.

[166]  Patrick Bouthemy,et al.  Action Localization with Tubelets from Motion , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[167]  Cordelia Schmid,et al.  Expanded Parts Model for Human Attribute and Action Recognition in Still Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[168]  Subhransu Maji,et al.  Deep filter banks for texture recognition and segmentation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[169]  Cordelia Schmid,et al.  Multi-fold MIL Training for Weakly Supervised Object Localization , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[170]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[171]  Cordelia Schmid,et al.  Learning to Track for Spatio-Temporal Action Localization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[172]  Tao Xiang,et al.  Weakly Supervised Action Detection , 2011, BMVC.

[173]  Gang Wang,et al.  Learning Discriminative and Shareable Features for Scene Classification , 2014, ECCV.

[174]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[175]  Cees Snoek,et al.  What do 15,000 object categories tell us about classifying and localizing actions? , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[176]  Martial Hebert,et al.  Efficient visual event detection using volumetric features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[177]  Cees Snoek,et al.  Bag-of-Fragments: Selecting and Encoding Video Fragments for Event Detection and Recounting , 2015, ICMR.

[178]  Antonio Torralba,et al.  Recognizing indoor scenes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[179]  Andrew Zisserman,et al.  Automatic Discovery and Optimization of Parts for Image Classification , 2015, ICLR.

[180]  Silvio Savarese,et al.  What are they doing? : Collective activity classification using spatio-temporal relationship among people , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[181]  Weiyu Zhang,et al.  From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding , 2013, 2013 IEEE International Conference on Computer Vision.

[182]  Thomas Hofmann,et al.  Support Vector Machines for Multiple-Instance Learning , 2002, NIPS.

[183]  Suman Saha,et al.  Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos , 2016, BMVC.

[184]  Chenliang Xu,et al.  Actor-Action Semantic Segmentation with Grouping Process Models , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[185]  Fei-Fei Li,et al.  Modeling mutual context of object and human pose in human-object interaction activities , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[186]  Cordelia Schmid,et al.  Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[187]  Bernt Schiele,et al.  How good are detection proposals, really? , 2014, BMVC.

[188]  Frank Dellaert,et al.  Dataset fingerprints: Exploring image collections through data mining , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[189]  Antonio Torralba,et al.  LabelMe video: Building a video database with human annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[190]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[191]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[192]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[193]  Koen E. A. van de Sande,et al.  Recommendations for video event recognition using concept vocabularies , 2013, ICMR.

[194]  Shuang Wu,et al.  Multimodal feature fusion for robust event detection in web videos , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[195]  Wei Liu,et al.  Multimedia classification and event detection using double fusion , 2013, Multimedia Tools and Applications.

[196]  Cordelia Schmid,et al.  Unsupervised object discovery and localization in the wild: Part-based matching with bottom-up region proposals , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[197]  L. Itti,et al.  Quantifying center bias of observers in free viewing of dynamic natural scenes. , 2009, Journal of vision.

[198]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[199]  Irfan A. Essa,et al.  Exploiting human actions and object context for recognition tasks , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[200]  Yi Yang,et al.  DevNet: A Deep Event Network for multimedia event detection and evidence recounting , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[201]  François Fleuret,et al.  FlowBoost — Appearance learning from sparsely annotated video , 2011, CVPR 2011.

[202]  Fei-Fei Li,et al.  Object-Centric Spatial Pooling for Image Classification , 2012, ECCV.

[203]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[204]  Cees Snoek,et al.  UvA-DARE ( Digital Academic Repository ) Event Fisher Vectors : Robust Encoding Visual Diversity of Visual Streams , 2015 .

[205]  Ronan Sicre,et al.  Discovering and Aligning Discriminative Mid-level Features for Image Classification , 2014, 2014 22nd International Conference on Pattern Recognition.

[206]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[207]  B. Yu,et al.  Boosting with the L_2-Loss: Regression and Classification , 2001 .

[208]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[209]  L. Cohen,et al.  Infant object segregation implies information integration. , 2001, Journal of experimental child psychology.

[210]  Deva Ramanan,et al.  Efficiently Scaling up Crowdsourced Video Annotation , 2012, International Journal of Computer Vision.