论文信息 - Convolutional Tensor-Train LSTM for Spatio-temporal Learning

Convolutional Tensor-Train LSTM for Spatio-temporal Learning

Learning from spatio-temporal data has numerous applications such as human-behavior analysis, object tracking, video compression, and physics simulation.However, existing methods still perform poorly on challenging video tasks such as long-term forecasting. This is because these kinds of challenging tasks require learning long-term spatio-temporal correlations in the video sequence. In this paper, we propose a higher-order convolutional LSTM model that can efficiently learn these correlations, along with a succinct representations of the history. This is accomplished through a novel tensor train module that performs prediction by combining convolutional features across time. To make this feasible in terms of computation and memory requirements, we propose a novel convolutional tensor-train decomposition of the higher-order model. This decomposition reduces the model complexity by jointly approximating a sequence of convolutional kernels asa low-rank tensor-train factorization. As a result, our model outperforms existing approaches, but uses only a fraction of parameters, including the baseline models.Our results achieve state-of-the-art performance in a wide range of applications and datasets, including the multi-steps video prediction on the Moving-MNIST-2and KTH action datasets as well as early activity recognition on the Something-Something V2 dataset.

[1] Anima Anandkumar,et al. Stochastically Rank-Regularized Tensor Regression Networks , 2019, ArXiv.

[2] Sergey Levine,et al. Deep visual foresight for planning robot motion , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[3] Geoffrey E. Hinton,et al. Generating Text with Recurrent Neural Networks , 2011, ICML.

[4] Eero P. Simoncelli,et al. Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[5] Yoshua Bengio,et al. Architectural Complexity Measures of Recurrent Neural Networks , 2016, NIPS.

[6] Susanne Westphal,et al. The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[7] Barbara Caputo,et al. Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[8] Hui Jiang,et al. Higher Order Recurrent Neural Networks , 2016, ArXiv.

[9] Sergey Levine,et al. Stochastic Adversarial Video Prediction , 2018, ArXiv.

[10] Jürgen Schmidhuber,et al. Parallel Multi-Dimensional LSTM, With Application to Fast Biomedical Volumetric Image Segmentation , 2015, NIPS.

[11] Nal Kalchbrenner,et al. Generating High Fidelity Images with Subscale Pixel Networks and Multidimensional Upscaling , 2018, ICLR.

[12] 知秀柴田. 5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[13] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14] Alexander Novikov,et al. Tensorizing Neural Networks , 2015, NIPS.

[15] Nitish Srivastava,et al. Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[16] Stefano Soatto,et al. Emergence of Invariance and Disentanglement in Deep Representations , 2017, 2018 Information Theory and Applications Workshop (ITA).

[17] Yunbo Wang,et al. Eidetic 3D LSTM: A Model for Video Prediction and Beyond , 2019, ICLR.

[18] Sergey Levine,et al. Unsupervised Learning for Physical Interaction through Video Prediction , 2016, NIPS.

[19] Petros Koumoutsakos,et al. ContextVP: Fully Context-Aware Video Prediction , 2017, ECCV.

[20] Maja Pantic,et al. Efficient N-Dimensional Convolutions via Higher-Order Factorization , 2019, ArXiv.

[21] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[22] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[23] Alexander Novikov,et al. Ultimate tensorization: compressing convolutional and FC layers alike , 2016, ArXiv.

[24] Bobby Bhattacharjee,et al. Tensorized Spectrum Preserving Compression for Neural Networks , 2018, ArXiv.

[25] Alex Graves,et al. Video Pixel Networks , 2016, ICML.

[26] Dustin Tran,et al. Image Transformer , 2018, ICML.

[27] Samy Bengio,et al. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[28] Dit-Yan Yeung,et al. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[29] Yongxin Yang,et al. Deep Multi-task Representation Learning: A Tensor Factorisation Approach , 2016, ICLR.

[30] Maja Pantic,et al. T-Net: Parametrizing Fully Convolutional Nets With a Single High-Order Tensor , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Anima Anandkumar,et al. Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..

[32] Silvio Savarese,et al. Social LSTM: Human Trajectory Prediction in Crowded Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Naftali Tishby,et al. Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).

[34] Anima Anandkumar,et al. Tensor Regression Networks , 2017, J. Mach. Learn. Res..

[35] Satoshi Nakamura,et al. Compressing recurrent neural network with tensor train , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[36] Vighnesh Birodkar,et al. Unsupervised Learning of Disentangled Representations from Video , 2017, NIPS.

[37] Sergey Levine,et al. Stochastic Variational Video Prediction , 2017, ICLR.

[38] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[39] Juan Carlos Niebles,et al. Learning to Decompose and Disentangle Representations for Video Prediction , 2018, NeurIPS.

[40] Philip S. Yu,et al. PredRNN: Recurrent Neural Networks for Predictive Learning using Spatiotemporal LSTMs , 2017, NIPS.

[41] Rob Fergus,et al. Stochastic Video Generation with a Learned Prior , 2018, ICML.

[42] Jakob Uszkoreit,et al. Scaling Autoregressive Video Models , 2019, ICLR.

[43] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[44] Volker Tresp,et al. Tensor-Train Recurrent Neural Networks for Video Classification , 2017, ICML.

[45] Yann LeCun,et al. Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[46] Seunghoon Hong,et al. Decomposing Motion and Content for Natural Video Sequence Prediction , 2017, ICLR.

[47] Yisong Yue,et al. Long-term Forecasting using Higher Order Tensor RNNs , 2017 .

[48] Ivan V. Oseledets,et al. Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition , 2014, ICLR.

[49] Philip S. Yu,et al. PredRNN++: Towards A Resolution of the Deep-in-Time Dilemma in Spatiotemporal Predictive Learning , 2018, ICML.

[50] Andrzej Cichocki,et al. Tensor Networks for Dimensionality Reduction and Large-scale Optimization: Part 1 Low-Rank Tensor Decompositions , 2016, Found. Trends Mach. Learn..

[51] Yisong Yue,et al. Long-term Forecasting using Tensor-Train RNNs , 2017, ArXiv.

[52] Barbara Caputo,et al. Recognizing human actions: a local SVM approach , 2004, ICPR 2004.

[53] Alexei A. Efros,et al. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[54] Eunhyeok Park,et al. Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications , 2015, ICLR.

[55] Ming Zhou,et al. A Tensorized Transformer for Language Modeling , 2019, NeurIPS.

[56] Tamara G. Kolda,et al. Tensor Decompositions and Applications , 2009, SIAM Rev..

[57] Gabriel Kreiman,et al. Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning , 2016, ICLR.

[58] Robust Deep Networks with Randomized Tensor Regression Layers , 2019 .

[59] Ivan Oseledets,et al. Tensor-Train Decomposition , 2011, SIAM J. Sci. Comput..