VIDEOWHISPER: Towards unsupervised learning of discriminative features of videos with RNN

We present VidedWhisfer, a novel approach for unsupervised video representation learning, in which video sequence is treated as a self-supervision entity based on the observation that the sequence encodes video temporal dynamics (e.g., object movement and event evolution). Specifically, for each video sequence, we use a pre-learned visual dictionary to generate a sequence of high-level semantics, dubbed “whisper”, which encodes both visual contents at the frame level and visual dynamics at the sequence level. VidedWhisfer is driven by a novel “sequence-to-whisper” learning strategy. Naturally, an end-to-end sequence-to-sequence learning model using RNN is modeled and trained to predict the whisper sequence. We propose two ways to generate video representation from the model. Through extensive experiments we demonstrate that video representation learned by VidedWhisfer is effective to boost fundamental video-related applications such as video retrieval and classification.

[1]  Bo Pang,et al.  Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , 2005, ACL.

[2]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[3]  Weihong Deng,et al.  Recurrent convolutional neural network for video classification , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[4]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[5]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[6]  Rongrong Ji,et al.  Crowd video retrieval via deep attribute-embedding graph ranking , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[7]  Nicu Sebe,et al.  Beyond Bag-of-Words: Fast video classification with Fisher Kernel Vector of Locally Aggregated Descriptors , 2015, 2015 IEEE International Conference on Multimedia and Expo (ICME).

[8]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[9]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[10]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[11]  Huanbo Luan,et al.  Discrete Collaborative Filtering , 2016, SIGIR.

[12]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[13]  Yi Yang,et al.  A discriminative CNN video representation for event detection , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Marc'Aurelio Ranzato,et al.  Video (language) modeling: a baseline for generative models of natural videos , 2014, ArXiv.

[16]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Shih-Fu Chang,et al.  Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Tat-Seng Chua,et al.  Neural Collaborative Filtering , 2017, WWW.

[19]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.