Correlation Net: Spatiotemporal multimodal deep learning for action recognition

Abstract This paper describes a network that captures multimodal correlations over arbitrary timestamps. The proposed scheme operates as a complementary, extended network over a multimodal convolutional neural network (CNN). Spatial and temporal streams are required for action recognition by a deep CNN, but overfitting reduction and fusing these two streams remain open problems. The existing fusion approach averages the two streams. Here we propose a correlation network with a Shannon fusion for learning a pre-trained CNN. A Long-range video may consist of spatiotemporal correlations over arbitrary times, which can be captured by forming the correlation network from simple fully connected layers. This approach was found to complement the existing network fusion methods. The importance of multimodal correlation is validated in comparison experiments on the UCF-101 and HMDB-51 datasets. The multimodal correlation enhanced the accuracy of the video recognition results.

[1]  Pichao Wang,et al.  Scene Flow to Action Map: A New Representation for RGB-D Based Action Recognition with Convolutional Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Jürgen Schmidhuber,et al.  Learning to forget: continual prediction with LSTM , 1999 .

[3]  Kenta Oono,et al.  Chainer : a Next-Generation Open Source Framework for Deep Learning , 2015 .

[4]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[6]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[8]  Xin Xu,et al.  Multimodal Gesture Recognition Based on the ResC3D Network , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[9]  Shun-ichi Amari,et al.  Correlation and Independence in the Neural Code , 2006, Neural Computation.

[10]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[11]  Takio Kurita,et al.  Action Recognition Using Three-Way Cross-Correlations Feature of Local Moton Attributes , 2010, 2010 20th International Conference on Pattern Recognition.

[12]  Xu Chen,et al.  Joint Representation Learning for Top-N Recommendation with Heterogeneous Information Sources , 2017, CIKM.

[13]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[14]  Ghassan Al-Regib,et al.  TS-LSTM and Temporal-Inception: Exploiting Spatiotemporal Dynamics for Activity Recognition , 2017, Signal Process. Image Commun..

[15]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[16]  Limin Wang,et al.  Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice , 2014, Comput. Vis. Image Underst..

[17]  Takio Kurita,et al.  Gated spatio and temporal convolutional neural network for activity recognition: towards gated multimodal deep learning , 2017, EURASIP J. Image Video Process..

[18]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Zhe Wang,et al.  Towards Good Practices for Very Deep Two-Stream ConvNets , 2015, ArXiv.

[21]  Carme Torras,et al.  Action Recognition Based on Efficient Deep Feature Learning in the Spatio-Temporal Domain , 2016, IEEE Robotics and Automation Letters.

[22]  James V. Stone Information Theory: A Tutorial Introduction , 2015, ArXiv.

[23]  Alexander C. Berg,et al.  Combining multiple sources of knowledge in deep CNNs for action recognition , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[24]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[26]  Tat-Seng Chua,et al.  Neural Collaborative Filtering , 2017, WWW.

[27]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[28]  Limin Wang,et al.  MoFAP: A Multi-level Representation for Action Recognition , 2015, International Journal of Computer Vision.

[29]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[30]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[31]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[33]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Horst Bischof,et al.  A Duality Based Approach for Realtime TV-L1 Optical Flow , 2007, DAGM-Symposium.

[35]  Henryk Sienkiewicz,et al.  Quo Vadis? , 1967, American Association of Industrial Nurses journal.

[36]  Xiaoyu Du,et al.  Outer Product-based Neural Collaborative Filtering , 2018, IJCAI.