An Egocentric Action Anticipation Framework via Fusing Intuition and Analysis

In this paper, we focus on egocentric action anticipation from videos, which enables various applications, such as helping intelligent wearable assistants understand users' needs and enhance their capabilities in the interaction process. It requires intelligent systems to observe from the perspective of the first person and predict an action before it occurs. Owing to the uncertainty of future, it is insufficient to perform action anticipation relying on visual information especially when there exists salient visual difference between past and future. In order to alleviate this problem, which we call visual gap in this paper, we propose one novel Intuition-Analysis Integrated (IAI) framework inspired by psychological research, which mainly consists of three parts: Intuition-based Prediction Network (IPN), Analysis-based Prediction Network (APN) and Adaptive Fusion Network (AFN). To imitate the implicit intuitive thinking process, we model IPN as an encoder-decoder structure and introduce one procedural instruction learning strategy implemented by textual pre-training. On the other hand, we allow APN to process information under designed rules to imitate the explicit analytical thinking, which is divided into three steps: recognition, transitions and combination. Both the procedural instruction learning strategy in IPN and the transition step of APN are crucial to improving the anticipation performance via mitigating the visual gap problem. Considering the complementarity of intuition and analysis, AFN adopts attention fusion to adaptively integrate predictions from IPN and APN to produce the final anticipation results. We conduct experiments on the largest egocentric video dataset. Qualitative and quantitative evaluation results validate the effectiveness of our IAI framework, and demonstrate the advantage of bridging visual gap by utilizing multi-modal information, including both visual features of observed segments and sequential instructions of actions.

[1]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[3]  Jean-Christophe Nebel,et al.  Recognition of Activities of Daily Living with Egocentric Vision: A Review , 2016, Sensors.

[4]  Dima Damen,et al.  Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[5]  Giovanni Maria Farinella,et al.  What Would You Expect? Anticipating Egocentric Actions With Rolling-Unrolling LSTMs and Modality Attention , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Sanja Fidler,et al.  MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Giovanni Maria Farinella,et al.  Next-active-object prediction from egocentric videos , 2017, J. Vis. Commun. Image Represent..

[8]  Silvio Savarese,et al.  A Hierarchical Representation for Future Action Prediction , 2014, ECCV.

[9]  Ivan Laptev,et al.  Leveraging the Present to Anticipate the Future in Videos , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[10]  Jenny Benois-Pineau,et al.  Perceptually-guided Understanding of Egocentric Video Content: Recognition of Objects to Grasp , 2018, ICMR.

[11]  Nicholas Rhinehart,et al.  First-Person Activity Forecasting with Online Inverse Reinforcement Learning , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Antonio Torralba,et al.  Anticipating Visual Representations from Unlabeled Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[14]  Ramesh C. Jain,et al.  A Survey on Food Computing , 2018, ACM Comput. Surv..

[15]  Rita Cucchiara,et al.  Hand segmentation for gesture recognition in EGO-vision , 2013, IMMPD '13.

[16]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Deva Ramanan,et al.  Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Giovanni Maria Farinella,et al.  Leveraging Uncertainty to Rethink Loss Functions and Evaluation Measures for Egocentric Action Anticipation , 2018, ECCV Workshops.

[19]  Hanqing Lu,et al.  EgoGesture: A New Dataset and Benchmark for Egocentric Hand Gesture Recognition , 2018, IEEE Transactions on Multimedia.

[20]  Yun Fu,et al.  Modeling Complex Temporal Composition of Actionlets for Activity Prediction , 2012, ECCV.

[21]  Andrea Bandini,et al.  Analysis of the hands in egocentric vision: A survey , 2020, IEEE transactions on pattern analysis and machine intelligence.

[22]  Yazan Abu Farha,et al.  When will you do what? - Anticipating Temporal Occurrences of Activities , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Seymour Epstein,et al.  Cognitive-experiential self-theory. , 1998 .

[24]  James M. Rehg,et al.  Social interactions: A first-person perspective , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Michael S. Ryoo,et al.  Human activity prediction: Early recognition of ongoing activities from streaming videos , 2011, 2011 International Conference on Computer Vision.

[26]  Matthias Rauterberg,et al.  The Evolution of First Person Vision Methods: A Survey , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[27]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[28]  J. Pretz,et al.  Intuition versus analysis: Strategy and experience in complex everyday problem solving , 2008, Memory & cognition.

[29]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  J Dowie,et al.  Clinical intuition and clinical analysis: expertise and the cognitive continuum , 1988 .

[31]  Shuqiang Jiang,et al.  A Delicious Recipe Analysis Framework for Exploring Multi-Modal Recipes with Various Attributes , 2017, ACM Multimedia.

[32]  Takeo Kanade,et al.  First-Person Vision , 2012, Proceedings of the IEEE.

[33]  Kris M. Kitani,et al.  Going Deeper into First-Person Activity Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[35]  Martial Hebert,et al.  Temporal segmentation and activity classification from first-person sensing , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[36]  Hema Swetha Koppula,et al.  Anticipating Human Activities Using Object Affordances for Reactive Robotic Response , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.