论文信息 - Layout-Induced Video Representation for Recognizing Agent-in-Place Actions

Layout-Induced Video Representation for Recognizing Agent-in-Place Actions

We address scene layout modeling for recognizing agent-in-place actions, which are actions associated with \textit{agents} who perform them and the \textit{places} where they occur, in the context of outdoor home surveillance. We introduce a novel representation to model the geometry and topology of scene layouts so that a network can generalize from the layouts observed in the training scenes to unseen scenes in the test set. This Layout-Induced Video Representation (LIVR) abstracts away low-level appearance variance and encodes geometric and topological relationships of places to explicitly model scene layout. LIVR partitions the semantic features of a scene into different places to force the network to learn generic place-based feature descriptions which are independent of specific scene layouts; then, LIVR dynamically aggregates features based on connectivities of places in each specific scene to model its layout. We introduce a new Agent-in-Place Action (APA) dataset\footnote{The dataset is pending legal review and will be released upon the acceptance of this paper.} to show that our method allows neural network models to generalize significantly better to unseen scenes.

[1] Takeo Kanade,et al. A System for Video Surveillance and Monitoring , 2000 .

[2] Cordelia Schmid,et al. Actor-Centric Relation Network , 2018, ECCV.

[3] Yihong Gong,et al. Linear spatial pyramid matching using sparse coding for image classification , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[4] Rahul Sukthankar,et al. Rethinking the Faster R-CNN Architecture for Temporal Action Localization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5] Iasonas Kokkinos,et al. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6] Svetlana Lazebnik,et al. Superparsing , 2010, International Journal of Computer Vision.

[7] Limin Wang,et al. Temporal Action Detection with Structured Segment Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[8] Jitendra Malik,et al. Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[9] Fabio Tozeto Ramos,et al. Simple online and realtime tracking , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[10] Martial Hebert,et al. Activity Forecasting , 2012, ECCV.

[11] Tao Mei,et al. Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12] Alexei A. Efros,et al. Scene completion using millions of photographs , 2008, Commun. ACM.

[13] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Larry S. Davis,et al. C-WSL: Count-guided Weakly Supervised Localization , 2017, ECCV.

[15] Jason J. Corso,et al. Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[16] Cordelia Schmid,et al. AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17] Yu-Gang Jiang,et al. Learning Semantic Feature Map for Visual Content Recognition , 2017, ACM Multimedia.

[18] Yuan-Fang Wang,et al. Dynamic Temporal Pyramid Network: A Closer Look at Multi-Scale Modeling for Activity Detection , 2018, ACCV.

[19] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[20] Song-Chun Zhu,et al. Learning Human-Object Interactions by Graph Parsing Neural Networks , 2018, ECCV.

[21] Antonio Torralba,et al. Nonparametric Scene Parsing via Label Transfer , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22] Thomas Serre,et al. HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[23] Larry S. Davis,et al. The Role of Context Selection in Object Detection , 2016, BMVC.

[24] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[25] Vittorio Ferrari,et al. COCO-Stuff: Thing and Stuff Classes in Context , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26] Shaogang Gong,et al. Discovery of Shared Semantic Spaces for Multiscene Video Query and Summarization , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[27] Alexei A. Efros,et al. Unsupervised Discovery of Mid-Level Discriminative Patches , 2012, ECCV.

[28] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.

[29] Yann LeCun,et al. A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30] G LoweDavid,et al. Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[31] Larry S. Davis,et al. Selective Encoding for Recognizing Unreliably Localized Faces , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[32] Shih-Fu Chang,et al. CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Bernard Ghanem,et al. ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Hao Su,et al. Object Bank: An Object-Level Image Representation for High-Level Visual Recognition , 2014, International Journal of Computer Vision.

[35] Xu Zhao,et al. Single Shot Temporal Action Detection , 2017, ACM Multimedia.

[36] Catherine Achard,et al. The DAily Home LIfe Activity Dataset: A High Semantic Activity Dataset for Online Recognition , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[37] Kate Saenko,et al. R-C3D: Region Convolutional 3D Network for Temporal Activity Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38] Ali Farhadi,et al. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[39] D. Burns,et al. End To End , 2015 .

[40] Jonghyun Choi,et al. Learning Temporal Regularity in Video Sequences , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41] Larry S. Davis,et al. Generating Holistic 3D Scene Abstractions for Text-Based Image Retrieval , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42] Trevor Darrell,et al. The pyramid match kernel: discriminative classification with sets of image features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[43] Yihong Gong,et al. Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[44] Chenliang Xu,et al. Actor-Action Semantic Segmentation with Grouping Process Models , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45] Wei Shen,et al. Spatial-temporal convolutional neural networks for anomaly detection and localization in crowded scenes , 2016, Signal Process. Image Commun..

[46] Larry S. Davis,et al. W4: Real-Time Surveillance of People and Their Activities , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[47] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[48] Bernard Ghanem,et al. End-to-End, Single-Stream Temporal Action Detection in Untrimmed Videos , 2017, BMVC.

[49] Ghassan Al-Regib,et al. TS-LSTM and Temporal-Inception: Exploiting Spatiotemporal Dynamics for Activity Recognition , 2017, Signal Process. Image Commun..

[50] Nicu Sebe,et al. Learning Deep Representations of Appearance and Motion for Anomalous Event Detection , 2015, BMVC.

[51] Andrew Zisserman,et al. Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[52] Yann LeCun,et al. Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[53] Jian Sun,et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54] Larry S. Davis,et al. ReMotENet: Efficient Relevant Motion Event Detection for Large-Scale Home Surveillance Videos , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[55] Antonio Torralba,et al. A Data-Driven Approach for Event Prediction , 2010, ECCV.

[56] Haroon Idrees,et al. The THUMOS challenge on action recognition for videos "in the wild" , 2016, Comput. Vis. Image Underst..

[57] Andrew Zisserman,et al. Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58] Ross B. Girshick,et al. Fast R-CNN , 2015, 1504.08083.

[59] Asim Kadav,et al. Attend and Interact: Higher-Order Object Interactions for Video Understanding , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[60] Abhinav Gupta,et al. Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[61] Luc Van Gool,et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[62] Trevor Darrell,et al. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[63] Koen E. A. van de Sande,et al. Segmentation as selective search for object recognition , 2011, 2011 International Conference on Computer Vision.

[64] Cordelia Schmid,et al. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[65] Thomas Mensink,et al. Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[66] Larry S. Davis,et al. Dynamic Zoom-in Network for Fast Object Detection in Large Images , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[67] Matthew J. Hausknecht,et al. Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[68] Larry S. Davis,et al. Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[69] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[70] Bernt Schiele,et al. Recognizing Fine-Grained and Composite Activities Using Hand-Centric Features and Script Data , 2015, International Journal of Computer Vision.

[71] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[72] Ming Yang,et al. 3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[73] François Brémond,et al. Video understanding for complex activity recognition , 2006, Machine Vision and Applications.

[74] Alberto Del Bimbo,et al. A data-driven approach for tag refinement and localization in web videos , 2015, Comput. Vis. Image Underst..

[75] Silvio Savarese,et al. Knowledge Transfer for Scene-Specific Motion Prediction , 2016, ECCV.