Convolutional Tensor-Train LSTM for Spatio-temporal Learning

Learning from spatio-temporal data has numerous applications such as human-behavior analysis, object tracking, video compression, and physics simulation.However, existing methods still perform poorly on challenging video tasks such as long-term forecasting. This is because these kinds of challenging tasks require learning long-term spatio-temporal correlations in the video sequence. In this paper, we propose a higher-order convolutional LSTM model that can efficiently learn these correlations, along with a succinct representations of the history. This is accomplished through a novel tensor train module that performs prediction by combining convolutional features across time. To make this feasible in terms of computation and memory requirements, we propose a novel convolutional tensor-train decomposition of the higher-order model. This decomposition reduces the model complexity by jointly approximating a sequence of convolutional kernels asa low-rank tensor-train factorization. As a result, our model outperforms existing approaches, but uses only a fraction of parameters, including the baseline models.Our results achieve state-of-the-art performance in a wide range of applications and datasets, including the multi-steps video prediction on the Moving-MNIST-2and KTH action datasets as well as early activity recognition on the Something-Something V2 dataset.

[1]  Anima Anandkumar,et al.  Stochastically Rank-Regularized Tensor Regression Networks , 2019, ArXiv.

[2]  Sergey Levine,et al.  Deep visual foresight for planning robot motion , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[3]  Geoffrey E. Hinton,et al.  Generating Text with Recurrent Neural Networks , 2011, ICML.

[4]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[5]  Yoshua Bengio,et al.  Architectural Complexity Measures of Recurrent Neural Networks , 2016, NIPS.

[6]  Susanne Westphal,et al.  The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[7]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[8]  Hui Jiang,et al.  Higher Order Recurrent Neural Networks , 2016, ArXiv.

[9]  Sergey Levine,et al.  Stochastic Adversarial Video Prediction , 2018, ArXiv.

[10]  Jürgen Schmidhuber,et al.  Parallel Multi-Dimensional LSTM, With Application to Fast Biomedical Volumetric Image Segmentation , 2015, NIPS.

[11]  Nal Kalchbrenner,et al.  Generating High Fidelity Images with Subscale Pixel Networks and Multidimensional Upscaling , 2018, ICLR.

[12]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[13]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14]  Alexander Novikov,et al.  Tensorizing Neural Networks , 2015, NIPS.

[15]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[16]  Stefano Soatto,et al.  Emergence of Invariance and Disentanglement in Deep Representations , 2017, 2018 Information Theory and Applications Workshop (ITA).

[17]  Yunbo Wang,et al.  Eidetic 3D LSTM: A Model for Video Prediction and Beyond , 2019, ICLR.

[18]  Sergey Levine,et al.  Unsupervised Learning for Physical Interaction through Video Prediction , 2016, NIPS.

[19]  Petros Koumoutsakos,et al.  ContextVP: Fully Context-Aware Video Prediction , 2017, ECCV.

[20]  Maja Pantic,et al.  Efficient N-Dimensional Convolutions via Higher-Order Factorization , 2019, ArXiv.

[21]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[22]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[23]  Alexander Novikov,et al.  Ultimate tensorization: compressing convolutional and FC layers alike , 2016, ArXiv.

[24]  Bobby Bhattacharjee,et al.  Tensorized Spectrum Preserving Compression for Neural Networks , 2018, ArXiv.

[25]  Alex Graves,et al.  Video Pixel Networks , 2016, ICML.

[26]  Dustin Tran,et al.  Image Transformer , 2018, ICML.

[27]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[28]  Dit-Yan Yeung,et al.  Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[29]  Yongxin Yang,et al.  Deep Multi-task Representation Learning: A Tensor Factorisation Approach , 2016, ICLR.

[30]  Maja Pantic,et al.  T-Net: Parametrizing Fully Convolutional Nets With a Single High-Order Tensor , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Anima Anandkumar,et al.  Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..

[32]  Silvio Savarese,et al.  Social LSTM: Human Trajectory Prediction in Crowded Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Naftali Tishby,et al.  Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).

[34]  Anima Anandkumar,et al.  Tensor Regression Networks , 2017, J. Mach. Learn. Res..

[35]  Satoshi Nakamura,et al.  Compressing recurrent neural network with tensor train , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[36]  Vighnesh Birodkar,et al.  Unsupervised Learning of Disentangled Representations from Video , 2017, NIPS.

[37]  Sergey Levine,et al.  Stochastic Variational Video Prediction , 2017, ICLR.

[38]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[39]  Juan Carlos Niebles,et al.  Learning to Decompose and Disentangle Representations for Video Prediction , 2018, NeurIPS.

[40]  Philip S. Yu,et al.  PredRNN: Recurrent Neural Networks for Predictive Learning using Spatiotemporal LSTMs , 2017, NIPS.

[41]  Rob Fergus,et al.  Stochastic Video Generation with a Learned Prior , 2018, ICML.

[42]  Jakob Uszkoreit,et al.  Scaling Autoregressive Video Models , 2019, ICLR.

[43]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[44]  Volker Tresp,et al.  Tensor-Train Recurrent Neural Networks for Video Classification , 2017, ICML.

[45]  Yann LeCun,et al.  Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[46]  Seunghoon Hong,et al.  Decomposing Motion and Content for Natural Video Sequence Prediction , 2017, ICLR.

[47]  Yisong Yue,et al.  Long-term Forecasting using Higher Order Tensor RNNs , 2017 .

[48]  Ivan V. Oseledets,et al.  Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition , 2014, ICLR.

[49]  Philip S. Yu,et al.  PredRNN++: Towards A Resolution of the Deep-in-Time Dilemma in Spatiotemporal Predictive Learning , 2018, ICML.

[50]  Andrzej Cichocki,et al.  Tensor Networks for Dimensionality Reduction and Large-scale Optimization: Part 1 Low-Rank Tensor Decompositions , 2016, Found. Trends Mach. Learn..

[51]  Yisong Yue,et al.  Long-term Forecasting using Tensor-Train RNNs , 2017, ArXiv.

[52]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, ICPR 2004.

[53]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[54]  Eunhyeok Park,et al.  Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications , 2015, ICLR.

[55]  Ming Zhou,et al.  A Tensorized Transformer for Language Modeling , 2019, NeurIPS.

[56]  Tamara G. Kolda,et al.  Tensor Decompositions and Applications , 2009, SIAM Rev..

[57]  Gabriel Kreiman,et al.  Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning , 2016, ICLR.

[58]  Robust Deep Networks with Randomized Tensor Regression Layers , 2019 .

[59]  Ivan Oseledets,et al.  Tensor-Train Decomposition , 2011, SIAM J. Sci. Comput..