Temporal Attention Neural Network for Video Understanding

Deep learning based vision understanding algorithms have recently approached human-level performance in object recognition and image captioning. These performance evaluations are, however, limited to static data and these algorithms are also limited. Few limitations of these methods include their inability to selectively encode human behavior, movement of multiple objects and time-varying variations in the background. To address these limitations and to extend these algorithms for analyzing dynamic videos, we propose a temporal attention CNN-RNN network with motion saliency map. Our proposed model overcome scarcity of usable information in encoded data and efficiently integrate motion features by incorporating dynamic nature of information present in successive frames. We evaluate our proposed model over UCF101 public dataset and our experiments demonstrate that our proposed model successfully extract motion information for video understanding without any computationally intensive preprocessing.

[1]  Minho Lee,et al.  Dynamic visual selective attention model , 2008, Neurocomputing.

[2]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[3]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[4]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[5]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[6]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[9]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[10]  Lorenzo Torresani,et al.  C3D: Generic Features for Video Analysis , 2014, ArXiv.

[11]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[12]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[13]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.