论文信息 - How Deep Features Have Improved Event Recognition in Multimedia

How Deep Features Have Improved Event Recognition in Multimedia

Event recognition is one of the areas in multimedia that is attracting great attention of researchers. Being applicable in a wide range of applications, from personal to collective events, a number of interesting solutions for event recognition using multimedia information sources have been proposed. On the other hand, following their immense success in classification, object recognition, and detection, deep learning has been shown to perform well in event recognition tasks also. Thus, a large portion of the literature on event analysis relies nowadays on deep learning architectures. In this article, we provide an extensive overview of the existing literature in this field, analyzing how deep features and deep learning architectures have changed the performance of event recognition frameworks. The literature on event-based analysis of multimedia contents can be categorized into four groups, namely (i) event recognition in single images; (ii) event recognition in personal photo collections; (iii) event recognition in videos; and (iv) event recognition in audio recordings. In this article, we extensively review different deep-learning-based frameworks for event recognition in these four domains. Furthermore, we also review some benchmark datasets made available to the scientific community to validate novel event recognition pipelines. In the final part of the manuscript, we also provide a detailed discussion on basic insights gathered from the literature review, and identify future trends and challenges.

Nicola Conci | Kashif Ahmad | Kashif Ahmad | N. Conci

[1] Farid Melgani,et al. A pool of deep models for event recognition , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[2] Andrew Zisserman,et al. Reading Text in the Wild with Convolutional Neural Networks , 2014, International Journal of Computer Vision.

[3] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Yi Yang,et al. Uncovering the Temporal Context for Video Question Answering , 2017, International Journal of Computer Vision.

[5] Francesco G. B. De Natale,et al. Robust event discovery from photo collections using Signature Image Bases (SIBs) , 2012, Multimedia Tools and Applications.

[6] Changsheng Li,et al. Combining remote sensing and ground census data to develop new maps of the distribution of rice agriculture in China , 2002 .

[7] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[8] Gernot A. Fink,et al. A Bag-of-Features approach to acoustic event detection , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] Xiu-Shen Wei,et al. Deep Spatial Pyramid Ensemble for Cultural Event Recognition , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[10] Yiannis S. Boutalis,et al. Selection of the proper Compact Composite Descriptor for improving content based image retrieval , 2009 .

[11] Justin Salamon,et al. A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.

[12] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[13] Yi Yang,et al. DevNet: A Deep Event Network for multimedia event detection and evidence recounting , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Xiao Liu,et al. Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15] Nicu Sebe,et al. Learning Deep Representations of Appearance and Motion for Anomalous Event Detection , 2015, BMVC.

[16] Matthieu Guillaumin,et al. Event Recognition in Photo Collections with a Stopwatch HMM , 2013, 2013 IEEE International Conference on Computer Vision.

[17] Ebroul Izquierdo,et al. MediaEval Benchmark: Social Event Detection in collaborative photo collections , 2011, MediaEval.

[18] Michael Riegler,et al. Social media and satellites , 2019, Multimedia Tools and Applications.

[19] Shih-Fu Chang,et al. Deep Cross Residual Learning for Multitask Visual Recognition , 2016, ACM Multimedia.

[20] G LoweDavid,et al. Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[21] Trevor Darrell,et al. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22] Larry S. Davis,et al. Selecting Relevant Web Trained Concepts for Automated Event Retrieval , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[23] Nicola Conci,et al. Convolutional Neural Networks for Disaster Images Retrieval , 2017, MediaEval.

[24] Florian Metze,et al. Detection for Real Life Audio DCASE Challenge , 2016 .

[25] Shih-Fu Chang,et al. Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26] Liang Lin,et al. Deep feature learning with relative distance comparison for person re-identification , 2015, Pattern Recognit..

[27] Nicolai Petkov,et al. Reliable detection of audio events in highly noisy environments , 2015, Pattern Recognit. Lett..

[28] Birger Kollmeier,et al. On the use of spectro-temporal features for the IEEE AASP challenge ‘detection and classification of acoustic scenes and events’ , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[29] Reza Fuad Rachmadi,et al. Spatial Pyramid Convolutional Neural Network for Social Event Detection in Static Image , 2016, ArXiv.

[30] Yiannis Kompatsiaris,et al. CERTH @ MediaEval 2013 Social Event Detection Task , 2013, MediaEval.

[31] Abhinav Gupta,et al. Videos as Space-Time Region Graphs , 2018, ECCV.

[32] Ying Liu,et al. Geological Disaster Recognition on Optical Remote Sensing Images Using Deep Learning , 2016 .

[33] Otávio A. B. Penatti,et al. Exploiting ConvNet Diversity for Flooding Identification , 2017, IEEE Geoscience and Remote Sensing Letters.

[34] Ainuddin Wahid Abdul Wahab,et al. An Overview of Audio Event Detection Methods from Feature Extraction to Classification , 2017, Appl. Artif. Intell..

[35] Chuang Gan,et al. End-to-End Learning of Motion Representation for Video Understanding , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36] Trevor Darrell,et al. Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[37] Xin Liu,et al. Exploiting Feature Hierarchies with Convolutional Neural Networks for Cultural Event Recognition , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[38] Larry S. Davis,et al. Exploiting local features from deep networks for image retrieval , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[39] Luc Van Gool,et al. AENet: Learning Deep Audio Features for Video Analysis , 2017, IEEE Transactions on Multimedia.

[40] Jun Wang,et al. Solving the Multiple-Instance Problem: A Lazy Learning Approach , 2000, ICML.

[41] Dahua Lin,et al. Recognize complex events from static images by fusing deep channels , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42] Sanjay Chawla,et al. Nazr-CNN: Fine-Grained Classification of UAV Imagery for Damage Assessment , 2016, 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[43] Andreas Kamilaris,et al. Disaster Monitoring using Unmanned Aerial Vehicles and Deep Learning , 2018, ArXiv.

[44] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45] Huy Phan,et al. Robust Audio Event Recognition with 1-Max Pooling Convolutional Neural Networks , 2016, INTERSPEECH.

[46] Yongdong Zhang,et al. Deep Fusion of Multiple Semantic Cues for Complex Event Recognition , 2016, IEEE Transactions on Image Processing.

[47] Francesco G. B. De Natale,et al. A hierarchical approach to event discovery from single images using MIL framework , 2016, 2016 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[48] Xavier Serra,et al. Freesound technical demo , 2013, ACM Multimedia.

[49] Alexander G. Hauptmann,et al. MoSIFT : Recognizing Human Actions in Surveillance Videos CMU-CS-09-161 , 2009 .

[50] Florian Metze,et al. A first attempt at polyphonic sound event detection using connectionist temporal classification , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51] Koen E. A. van de Sande,et al. Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[52] Kyogu Lee,et al. Rare Sound Event Detection Using 1D Convolutional Recurrent Neural Networks , 2017, DCASE.

[53] C.-C. Jay Kuo,et al. Where am I? Scene Recognition for Mobile Robots using Audio Features , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[54] Ankit Shah,et al. DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System , 2017, DCASE.

[55] G. Carbone,et al. Monitoring agricultural drought for arid and humid regions using multi-sensor remote sensing data , 2010 .

[56] Heikki Huttunen,et al. Polyphonic sound event detection using multi label deep neural networks , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[57] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58] Shu-Ching Chen,et al. Automatic Video Event Detection for Imbalance Data Using Enhanced Ensemble Deep Learning , 2017, Int. J. Semantic Comput..

[59] Michael Riegler,et al. CNN and GAN Based Satellite and Social Media Data Fusion for Disaster Detection , 2017, MediaEval.

[60] David A. Shamma,et al. YFCC100M , 2015, Commun. ACM.

[61] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[62] Georgios Petkos,et al. Social Event Detection at MediaEval : a three-year retrospect of tasks and results , 2014 .

[63] Michael Riegler,et al. LIRE: open source visual information retrieval , 2016, MMSys.

[64] D. T. Lee,et al. Video Event Detection via Multi-modality Deep Learning , 2014, 2014 22nd International Conference on Pattern Recognition.

[65] Karol J. Piczak. ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[66] Tuomas Virtanen,et al. Sound Event Detection in Multichannel Audio Using Spatial and Harmonic Features , 2017, DCASE.

[67] Christopher Joseph Pal,et al. Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[68] Francesco G. B. De Natale,et al. A saliency-based approach to event recognition , 2018, Signal Process. Image Commun..

[69] Yu Tsao,et al. FOR TASK 3 : SOUND EVENT DETECTION IN REAL LIFE AUDIO , 2016 .

[70] Abhinav Gupta,et al. Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[71] Ramakant Nevatia,et al. Video-based event recognition: activity representation and probabilistic recognition methods , 2004, Comput. Vis. Image Underst..

[72] Francesco G. B. De Natale,et al. USED: a large-scale social event detection dataset , 2016, MMSys.

[73] Dmitrii Ubskii,et al. SOUND EVENT DETECTION IN REAL-LIFE AUDIO , 2016 .

[74] Yi Yang,et al. Semantic Pooling for Complex Event Analysis in Untrimmed Videos , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[75] Mark D. McDonnell,et al. Understanding Data Augmentation for Classification: When to Warp? , 2016, 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA).

[76] Tuomas Virtanen,et al. Filterbank learning for deep neural network based polyphonic sound event detection , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[77] Florian Metze,et al. CMU-Informedia @ TRECVID 2013 Multimedia Event Detection , 2013 .

[78] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[79] Ramakant Nevatia,et al. VERL: An Ontology Framework for Representing and Annotating Video Events , 2005, IEEE Multim..

[80] Ji-Hwan Kim,et al. Audio Event Classification Using Deep Neural Networks , 2015 .

[81] Sergio Escalera,et al. ChaLearn Looking at People 2015: Apparent Age and Cultural Event Recognition Datasets and Results , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[82] Nicu Sebe,et al. Event-based media processing and analysis: A survey of the literature , 2016, Image Vis. Comput..

[83] Tao Mei,et al. Multigranular Event Recognition of Personal Photo Albums , 2018, IEEE Transactions on Multimedia.

[84] Qiang Ji,et al. Video event recognition with deep hierarchical context model , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[85] Thomas S. Huang,et al. Album-based object-centric event recognition , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[86] Xiao Liu,et al. Multimodal Keyless Attention Fusion for Video Classification , 2018, AAAI.

[87] Nadjia Benblidia,et al. Event recognition in photo albums using probabilistic graphical models and feature relevance , 2016, J. Vis. Commun. Image Represent..

[88] Moncef Gabbouj,et al. Supervised model training for overlapping sound events based on unsupervised source separation , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[89] Janto Skowronek,et al. Automatic surveillance of the acoustic activity in our living environment , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[90] Dennis Koelma,et al. The ImageNet Shuffle: Reorganized Pre-training for Video Event Detection , 2016, ICMR.

[91] Minh-Son Dao,et al. A Domain-based Late-Fusion for Disaster Image Retrieval from Social Media , 2017, MediaEval.

[92] Dimitar Filev,et al. Induced ordered weighted averaging operators , 1999, IEEE Trans. Syst. Man Cybern. Part B.

[93] Yiannis Kompatsiaris,et al. Visual and Textual Analysis of Social Media and Satellite Images for Flood Detection @ Multimedia Satellite Task MediaEval 2017 , 2017, MediaEval.

[94] Amaia Salvador,et al. Cultural Event recognition with visual ConvNets and temporal models , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[95] Yi Yang,et al. You Lead, We Exceed: Labor-Free Video Concept Learning by Jointly Exploiting Web Videos and Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[96] Zhe Wang,et al. Better Exploiting OS-CNNs for Better Event Recognition in Images , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[97] Li Fei-Fei,et al. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos , 2015, International Journal of Computer Vision.

[98] Yiannis S. Boutalis,et al. CEDD: Color and Edge Directivity Descriptor: A Compact Descriptor for Image Indexing and Retrieval , 2008, ICVS.

[99] Cordelia Schmid,et al. AXES at TRECVID 2012: KIS, INS, and MED , 2012, TRECVID.

[100] Heikki Huttunen,et al. Recognition of acoustic events using deep neural networks , 2014, 2014 22nd European Signal Processing Conference (EUSIPCO).

[101] Quoc V. Le,et al. AutoAugment: Learning Augmentation Policies from Data , 2018, ArXiv.

[102] Nitish Srivastava,et al. Exploiting Image-trained CNN Architectures for Unconstrained Video Classification , 2015, BMVC.

[103] Ebroul Izquierdo,et al. Social event detection and retrieval in collaborative photo collections , 2012, ICMR '12.

[104] Fei-Fei Li,et al. What, where and who? Classifying events by scene and object recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[105] Francesco G. B. De Natale,et al. Event recognition in personal photo collections via multiple instance learning-based classification of multiple images , 2017, J. Electronic Imaging.

[106] Chen Sun,et al. Webly-Supervised Video Recognition by Mutually Voting for Relevant Web Images and Web Video Frames , 2016, ECCV.

[107] Muhammad Hanif,et al. Flood detection using Social Media Data and Spectral Regression based Kernel Discriminant Analysis , 2017, MediaEval.

[108] Benjamin Bischke,et al. The Multimedia Satellite Task at MediaEval 2018: Emergency Response for Flooding Events , 2018 .

[109] Shengchen Li,et al. SOUND EVENT DETECTION IN REAL LIFE AUDIO USING MULTI-MODEL SYSTEM , 2017 .

[110] Alberto Del Bimbo,et al. Deep networks for audio event classification in soccer videos , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[111] Andreas Dengel,et al. Contextual Enrichment of Remote-Sensed Events with Social Media Streams , 2016, ACM Multimedia.

[112] Zi Huang,et al. Robust spatial-temporal deep model for multimedia event detection , 2016, Neurocomputing.

[113] Florian Metze,et al. Recurrent Support Vector Machines for Audio-Based Multimedia Event Detection , 2016, ICMR.

[114] Lin Li,et al. Data-Driven Flood Detection using Neural Networks , 2017, MediaEval.

[115] Xiaoming Liu,et al. Sports Videos in the Wild (SVW): A video dataset for sports analysis , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[116] Annamaria Mesaros,et al. Metrics for Polyphonic Sound Event Detection , 2016 .

[117] Farid Melgani,et al. Ensemble of Deep Models for Event Recognition , 2018, ACM Trans. Multim. Comput. Commun. Appl..

[118] Qiang Chen,et al. Network In Network , 2013, ICLR.

[119] Nicola Conci,et al. Event Recognition in Personal Photo Collections: An Active Learning Approach , 2018, Visual Information Processing and Communication.

[120] Daniel P. W. Ellis,et al. IBM Research and Columbia University TRECVID-2011 Multimedia Event Detection (MED) System , 2011, TRECVID.

[121] Lothar Thiele,et al. Efficient Convolutional Neural Network For Audio Event Detection , 2017, ArXiv.

[122] Wei Zhang,et al. Optical Flow Guided Feature: A Fast and Robust Motion Representation for Video Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[123] Daniel P. W. Ellis,et al. Spectral vs. spectro-temporal features for acoustic event detection , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[124] Heikki Huttunen,et al. Recurrent neural networks for polyphonic sound event detection in real life recordings , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[125] Nam Soo Kim,et al. DNN-BASED SOUND EVENT DETECTION WITH EXEMPLAR-BASED APPROACH FOR NOISE REDUCTION , 2016 .

[126] T. Andringa,et al. DARES-G 1 : Database of Annotated Real-world Everyday Sounds , 2009 .

[127] Mubarak Shah,et al. Real-World Anomaly Detection in Surveillance Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[128] Samira Pouyanfar,et al. Semantic Event Detection Using Ensemble Deep Learning , 2016, 2016 IEEE International Symposium on Multimedia (ISM).

[129] Liang Wang,et al. Learning Representative Deep Features for Image Set Analysis , 2015, IEEE Transactions on Multimedia.

[130] Georges Quénot,et al. TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[131] Bolei Zhou,et al. Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[132] Gernot A. Fink,et al. BAG-OF-FEATURES ACOUSTIC EVENT DETECTION FOR SENSOR NETWORKS , 2016 .

[133] Archontis Politis,et al. Multichannel Sound Event Detection Using 3D Convolutional Neural Networks for Learning Inter-channel Features , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[134] Xinmei Tian,et al. Event recognition in personal photo collections using hierarchical model and multiple features , 2015, 2015 IEEE 17th International Workshop on Multimedia Signal Processing (MMSP).

[135] Nicolai Petkov,et al. Audio Surveillance of Roads: A System for Detecting Anomalous Sounds , 2016, IEEE Transactions on Intelligent Transportation Systems.

[136] Yi Yang,et al. A discriminative CNN video representation for event detection , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[137] Vasileios Mezaris. Social event detection at MediaEval: a 3-year retrospect of tasks and results , 2014 .

[138] Dong Liu,et al. EventNet: A Large Scale Structured Concept Library for Complex Event Detection in Video , 2015, ACM Multimedia.

[139] R. Eberhart,et al. Empirical study of particle swarm optimization , 1999, Proceedings of the 1999 Congress on Evolutionary Computation-CEC99 (Cat. No. 99TH8406).

[140] Luc Van Gool,et al. Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Detection , 2016 .

[141] Kyogu Lee,et al. Ensemble of Convolutional Neural Networks for Weakly-supervised Sound Event Detection Using Multiple Scale Input , 2017, DCASE.

[142] Zvi Kons,et al. Audio event classification using deep neural networks , 2013, INTERSPEECH.

[143] Soma Shiraishi,et al. Analysis of satellite images for disaster detection , 2016, 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS).

[144] Florian Metze,et al. Audio-based multimedia event detection using deep recurrent neural networks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[145] Benoit Huet,et al. Heterogeneous features and model selection for event-based media classification , 2013, ICMR.

[146] Alexander G. Hauptmann,et al. Leveraging high-level and low-level features for multimedia event detection , 2012, ACM Multimedia.

[147] Joost van de Weijer,et al. Multi-modal Deep Learning Approach for Flood Detection , 2017, MediaEval.

[148] Francesco G. B. De Natale,et al. A Comparative Study of Global and Deep Features for the Analysis of User-Generated Natural Disaster Related Images , 2018, 2018 IEEE 13th Image, Video, and Multidimensional Signal Processing Workshop (IVMSP).

[149] Xiaoqiang Lu,et al. Deep Representation for Abnormal Event Detection in Crowded Scenes , 2016, ACM Multimedia.

[150] Andrew Zisserman,et al. Representing shape with a spatial pyramid kernel , 2007, CIVR '07.

[151] Christopher Hunt,et al. Notes on the OpenSURF Library , 2009 .

[152] Andreas Dengel,et al. Detection of Flooding Events in Social Multimedia and Satellite Imagery using Deep Neural Networks , 2017, MediaEval.

[153] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.

[154] Yiannis Kompatsiaris,et al. Social Event Detection at MediaEval 2012: Challenges, Dataset and Evaluation , 2012, MediaEval.

[155] Dan Stowell,et al. Detection and classification of acoustic scenes and events: An IEEE AASP challenge , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[156] Tomoki Toda,et al. BLSTM-HMM hybrid system combined with sound activity detection network for polyphonic Sound Event Detection , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[157] Cordelia Schmid,et al. Action recognition by dense trajectories , 2011, CVPR 2011.

[158] Tao Chen,et al. DeepSentiBank: Visual Sentiment Concept Classification with Deep Convolutional Neural Networks , 2014, ArXiv.

[159] Tomoki Toda,et al. Bidirectional LSTM-HMM Hybrid System for Polyphonic Sound Event Detection , 2016, DCASE.

[160] Alexander G. Hauptmann,et al. MoSIFT: Recognizing Human Actions in Surveillance Videos , 2009 .

[161] Graham W. Taylor,et al. Dataset Augmentation in Feature Space , 2017, ICLR.

[162] Cordelia Schmid,et al. Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[163] Nojun Kwak,et al. Cultural event recognition by subregion classification with convolutional neural network , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[164] Luc Van Gool,et al. Transferring Deep Object and Scene Representations for Event Recognition in Still Images , 2017, International Journal of Computer Vision.

[165] Antonio Torralba,et al. SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[166] John R. Smith,et al. Large-scale concept ontology for multimedia , 2006, IEEE MultiMedia.

[167] Tao Mei,et al. Relaxing from Vocabulary: Robust Weakly-Supervised Deep Learning for Vocabulary-Free Image Tagging , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[168] Tuomas Virtanen,et al. Acoustic event detection in real life recordings , 2010, 2010 18th European Signal Processing Conference.

[169] VirtanenTuomas,et al. Detection and Classification of Acoustic Scenes and Events , 2018 .