Forecasting Future Action Sequences with Neural Memory Networks

We propose a novel neural memory network based framework for future action sequence forecasting. This is a challenging task where we have to consider short-term, within sequence relationships as well as relationships in between sequences, to understand how sequences of actions evolve over time. To capture these relationships effectively, we introduce neural memory networks to our modelling scheme. We show the significance of using two input streams, the observed frames and the corresponding action labels, which provide different information cues for our prediction task. Furthermore, through the proposed method we effectively map the long-term relationships among individual input sequences through separate memory modules, which enables better fusion of the salient features. Our method outperforms the state-of-the-art approaches by a large margin on two publicly available datasets: Breakfast and 50 Salads.

[1]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[2]  Thomas Serre,et al.  The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Lars Petersson,et al.  Encouraging LSTMs to Anticipate Actions Very Early , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4]  Richard Socher,et al.  Ask Me Anything: Dynamic Memory Networks for Natural Language Processing , 2015, ICML.

[5]  Sankar K. Pal,et al.  Multilayer perceptron, fuzzy sets, and classification , 1992, IEEE Trans. Neural Networks.

[6]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Stan Sclaroff,et al.  Learning Activity Progression in LSTMs for Activity Detection and Early Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Sridha Sridharan,et al.  Two Stream LSTM: A Deep Fusion Framework for Human Action Recognition , 2017, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[9]  Juergen Gall,et al.  Weakly Supervised Action Learning with RNN Based Fine-to-Coarse Modeling , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Stephen J. McKenna,et al.  Combining embedded accelerometers with computer vision for recognizing food preparation activities , 2013, UbiComp.

[11]  Sridha Sridharan,et al.  Tree Memory Networks for Modelling Long-term Temporal Dependencies , 2017, Neurocomputing.

[12]  Sridha Sridharan,et al.  Multi-Level Sequence GAN for Group Activity Recognition , 2018, ACCV.

[13]  Sridha Sridharan,et al.  Coupled Generative Adversarial Network for Continuous Fine-Grained Action Segmentation , 2019, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[14]  Hong Yu,et al.  Neural Semantic Encoders , 2016, EACL.

[15]  Ramakant Nevatia,et al.  RED: Reinforced Encoder-Decoder Networks for Action Anticipation , 2017, BMVC.

[16]  Amit K. Roy-Chowdhury,et al.  Joint Prediction of Activity Labels and Starting Times in Untrimmed Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[17]  Chen Chen,et al.  Memory Attention Networks for Skeleton-Based Action Recognition , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[18]  Yazan Abu Farha,et al.  When will you do what? - Anticipating Temporal Occurrences of Activities , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[20]  Gregory D. Hager,et al.  Segmental Spatiotemporal CNNs for Fine-Grained Action Segmentation , 2016, ECCV.

[21]  Fiora Pirri,et al.  Anticipation and next action forecasting in video: an end-to-end model with memory , 2019, ArXiv.