Deep Learning for Video Classification and Captioning

Accelerated by the tremendous increase in Internet bandwidth and storage space, video data has been generated, published and spread explosively, becoming an indispensable part of today's big data. In this paper, we focus on reviewing two lines of research aiming to stimulate the comprehension of videos with deep learning: video classification and video captioning. While video classification concentrates on automatically labeling video clips based on their semantic contents like human actions or complex events, video captioning attempts to generate a complete and natural sentence, enriching the single label as in video classification, to capture the most informative dynamics in videos. In addition, we also provide a review of popular benchmarks and competitions, which are critical for evaluating the technical progress of this vibrant field.

[1]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[2]  Andrea Vedaldi,et al.  Dynamic Image Networks for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[5]  Bowen Zhang,et al.  Real-Time Action Recognition with Enhanced Motion Vector CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[7]  Wei Xu,et al.  Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Bernt Schiele,et al.  Coherent Multi-sentence Video Description with Variable Level of Detail , 2014, GCPR.

[9]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[10]  Rama Chellappa,et al.  Machine Recognition of Human Activities: A Survey , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[11]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[12]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[13]  Shih-Fu Chang,et al.  Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Chenliang Xu,et al.  A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Bernt Schiele,et al.  Translating Video Content to Natural Language Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[19]  Cees Snoek,et al.  VideoLSTM convolves, attends and flows for action recognition , 2016, Comput. Vis. Image Underst..

[20]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Wei Chen,et al.  Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework , 2015, AAAI.

[22]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[23]  Xi Wang,et al.  Evaluating Two-Stream CNN for Video Classification , 2015, ICMR.

[24]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[25]  Bernt Schiele,et al.  The Long-Short Story of Movie Description , 2015, GCPR.

[26]  Zhongchao Shi,et al.  Boosting Video Description Generation by Explicitly Translating from Frame-Level Captions , 2016, ACM Multimedia.

[27]  Xi Wang,et al.  Multi-Stream Multi-Class Fusion of Deep Networks for Video Classification , 2016, ACM Multimedia.

[28]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[29]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[30]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Tao Mei,et al.  Learning Deep Intrinsic Video Representation by Exploring Temporal Coherence and Graph Structure , 2016, IJCAI.

[32]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Chong-Wah Ngo,et al.  Click-through-based cross-view learning for image search , 2014, SIGIR.

[34]  Guo-Jun Qi,et al.  Differential Recurrent Neural Networks for Action Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[35]  Haroon Idrees,et al.  The THUMOS challenge on action recognition for videos "in the wild" , 2016, Comput. Vis. Image Underst..

[36]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[37]  Lin Sun,et al.  Human Action Recognition Using Factorized Spatio-Temporal Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[38]  Shih-Fu Chang,et al.  Consumer video understanding: a benchmark database and an evaluation of human and machine performance , 2011, ICMR.

[39]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[40]  Chong-Wah Ngo,et al.  Towards textually describing complex video contents with audio-visual concept classifiers , 2011, ACM Multimedia.

[41]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Marcus Rohrbach,et al.  A Multi-scale Multiple Instance Video Description Network , 2015, ArXiv.

[43]  Yifan Peng,et al.  Studying Relationships between Human Gaze, Description, and Computer Vision , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Yi Yang,et al.  A discriminative CNN video representation for event detection , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Tao Mei,et al.  Action Recognition by Learning Deep Multi-Granular Spatio-Temporal Video Representation , 2016, ICMR.

[46]  Juan Carlos Niebles,et al.  Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[47]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[48]  Nitish Srivastava,et al.  Exploiting Image-trained CNN Architectures for Unconstrained Video Classification , 2015, BMVC.

[49]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[50]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Bernt Schiele,et al.  A dataset for Movie Description , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[53]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[54]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[55]  Bernt Schiele,et al.  Recognizing Fine-Grained and Composite Activities Using Hand-Centric Features and Script Data , 2015, International Journal of Computer Vision.

[56]  Xi Wang,et al.  Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification , 2015, ACM Multimedia.

[57]  Kate Saenko,et al.  Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text , 2016, EMNLP.

[58]  W. Pitts,et al.  A Logical Calculus of the Ideas Immanent in Nervous Activity (1943) , 2021, Ideas That Created the Future.

[59]  Ting Liu,et al.  Recent advances in convolutional neural networks , 2015, Pattern Recognit..

[60]  Wei Xie,et al.  Security and Privacy in Cloud Computing: A Survey , 2010, 2010 Sixth International Conference on Semantics, Knowledge and Grids.

[61]  Chong-Wah Ngo,et al.  Annotation for free: video tagging by mining user search behavior , 2013, ACM Multimedia.

[62]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[63]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[64]  Ruslan Salakhutdinov,et al.  Action Recognition using Visual Attention , 2015, NIPS 2015.

[65]  Christopher Joseph Pal,et al.  Delving Deeper into Convolutional Networks for Learning Video Representations , 2015, ICLR.

[66]  Christopher Joseph Pal,et al.  Using Descriptive Video Services to Create a Large Data Source for Video Annotation Research , 2015, ArXiv.

[67]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[69]  Dong Liu,et al.  EventNet: A Large Scale Structured Concept Library for Complex Event Detection in Video , 2015, ACM Multimedia.

[70]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[71]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[72]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[73]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[74]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[75]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[76]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[77]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[78]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[79]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[80]  Eduardo Sontag,et al.  Turing computability with neural nets , 1991 .

[81]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[82]  Chong-Wah Ngo,et al.  Semi-supervised Domain Adaptation with Subspace Learning for visual recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[83]  Kunio Fukunaga,et al.  Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions , 2002, International Journal of Computer Vision.

[84]  Jiebo Luo,et al.  Kodak consumer video benchmark data set : concept definition and annotation * * , 2008 .

[85]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[86]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[87]  Kate Saenko,et al.  Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild , 2014, COLING.

[88]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[89]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[90]  Subhashini Venugopalan,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[91]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[92]  Chong-Wah Ngo,et al.  Learning Query and Image Similarities with Ranking Canonical Correlation Analysis , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[93]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[94]  Tao Mei,et al.  Jointly Modeling Embedding and Translation to Bridge Video and Language , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[95]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[96]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[97]  Mark Sanderson,et al.  Automatic video tagging using content redundancy , 2009, SIGIR.

[98]  Chong-Wah Ngo,et al.  Predicting domain adaptivity: redo or recycle? , 2012, ACM Multimedia.

[99]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[100]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[101]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[102]  D. Hubel,et al.  Receptive fields and functional architecture of monkey striate cortex , 1968, The Journal of physiology.

[103]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[104]  Ali Farhadi,et al.  Actions ~ Transformations , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[105]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[106]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[107]  Trevor Darrell,et al.  YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[108]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[109]  Kwong-Sak Leung,et al.  A Survey of Crowdsourcing Systems , 2011, 2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing.

[110]  Lorenzo Torresani,et al.  C3D: Generic Features for Video Analysis , 2014, ArXiv.

[111]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[112]  Elisa Ricci,et al.  Space speaks: towards socially and personality aware visual surveillance , 2010, MPVA '10.

[113]  Kunihiko Fukushima,et al.  Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.

[114]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).