Seeing and Hearing Egocentric Actions: How Much Can We Learn?

Our interaction with the world is an inherently multimodal experience. However, the understanding of human-to-object interactions has historically been addressed focusing on a single modality. In particular, a limited number of works have considered to integrate the visual and audio modalities for this purpose. In this work, we propose a multimodal approach for egocentric action recognition in a kitchen environment that relies on audio and visual information. Our model combines a sparse temporal sampling strategy with a late fusion of audio, spatial, and temporal streams. Experimental results on the EPIC-Kitchens dataset show that multimodal integration leads to better performance than unimodal approaches. In particular, we achieved a 5.18% improvement over the state of the art on verb classification.

[1]  Dawei Liang,et al.  Audio-Based Activities of Daily Living (ADL) Recognition with Large-Scale Acoustic Embeddings from Online Videos , 2018, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol..

[2]  Alejandro Cartas,et al.  How Much Does Audio Matter to Recognize Egocentric Object Interactions? , 2019, ArXiv.

[3]  Marcel van Gerven,et al.  Deep Impression: Audiovisual Deep Residual Networks for Multimodal Apparent Personality Trait Recognition , 2016, ECCV Workshops.

[4]  C.-C. Jay Kuo,et al.  Audio content analysis for online audiovisual data segmentation and classification , 2001, IEEE Trans. Speech Audio Process..

[5]  Nasser Kehtarnavaz,et al.  A survey of depth and inertial sensor fusion for human action recognition , 2015, Multimedia Tools and Applications.

[6]  Bernt Schiele,et al.  A tutorial on human activity recognition using body-worn inertial sensors , 2014, CSUR.

[7]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[8]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Sergio Escalera,et al.  LSTA: Long Short-Term Attention for Egocentric Action Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Michael Wagner,et al.  Multimodal Depression Detection: Fusion Analysis of Paralinguistic, Head Pose and Eye Gaze Behaviors , 2018, IEEE Transactions on Affective Computing.

[11]  Bernard Ghanem,et al.  ActivityNet Challenge 2017 Summary , 2017, ArXiv.

[12]  Heng Wang,et al.  Large-Scale Weakly-Supervised Pre-Training for Video Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[14]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[15]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Kris M. Kitani,et al.  Going Deeper into First-Person Activity Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Mehrtash Tafazzoli Harandi,et al.  Going deeper into action recognition: A survey , 2016, Image Vis. Comput..

[18]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[20]  Chong Li,et al.  A Hierarchical Deep Fusion Framework for Egocentric Activity Recognition using a Wearable Hybrid Sensor System , 2019, Sensors.

[21]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[22]  H. Bülthoff,et al.  Merging the senses into a robust percept , 2004, Trends in Cognitive Sciences.

[23]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Tara N. Sainath,et al.  Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Jianguo Zhang,et al.  Multimodal Egocentric Analysis of Focused Interactions , 2018, IEEE Access.

[26]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[27]  Oswald Lanz,et al.  Multi-Speaker Tracking From an Audio–Visual Sensing Device , 2019, IEEE Transactions on Multimedia.

[28]  Hedvig Kjellström,et al.  Audio-visual classification and detection of human manipulation actions , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[29]  Xiao Liu,et al.  Multimodal Keyless Attention Fusion for Video Classification , 2018, AAAI.

[30]  James M. Rehg,et al.  Learning to Recognize Daily Actions Using Gaze , 2012, ECCV.

[31]  Bernard Ghanem,et al.  The ActivityNet Large-Scale Activity Recognition Challenge 2018 Summary , 2018, ArXiv.

[32]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[33]  Yi Yang,et al.  Baidu-UTS Submission to the EPIC-Kitchens Action Recognition Challenge 2019 , 2019, ArXiv.

[34]  Ali Farhadi,et al.  Understanding egocentric activities , 2011, 2011 International Conference on Computer Vision.

[35]  Tara N. Sainath,et al.  Convolutional neural networks for small-footprint keyword spotting , 2015, INTERSPEECH.

[36]  Deva Ramanan,et al.  Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Soo-Young Lee,et al.  Environmental audio scene and activity recognition through mobile-based crowdsourcing , 2012, IEEE Transactions on Consumer Electronics.

[39]  Horst Bischof,et al.  A Duality Based Approach for Realtime TV-L1 Optical Flow , 2007, DAGM-Symposium.

[40]  Andrew Owens,et al.  Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Christian Wolf,et al.  Object Level Visual Reasoning in Videos , 2018, ECCV.

[42]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Dima Damen,et al.  Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[44]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Qin Jin,et al.  Video Description Generation using Audio and Visual Cues , 2016, ICMR.

[46]  C. V. Jawahar,et al.  First Person Action Recognition Using Deep Learned Descriptors , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Kristen Grauman,et al.  Object-Centric Spatio-Temporal Pyramids for Egocentric Activity Recognition , 2013, BMVC.

[48]  Sergio Escalera,et al.  FBK-HUPBA Submission to the EPIC-Kitchens 2019 Action Recognition Challenge , 2019, ArXiv.

[49]  Daniel Roggen,et al.  Deep Convolutional and LSTM Recurrent Neural Networks for Multimodal Wearable Activity Recognition , 2016, Sensors.

[50]  Xiao Liu,et al.  Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[51]  Omesh Tickoo,et al.  Uncertainty-Aware Audiovisual Activity Recognition Using Deep Bayesian Variational Inference , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[52]  Mohan M. Trivedi,et al.  Hierarchical audio-visual cue integration framework for activity analysis in intelligent meeting rooms , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[53]  Manuele Bicego,et al.  Audio-Visual Event Recognition in Surveillance Video Sequences , 2007, IEEE Transactions on Multimedia.

[54]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[55]  Oswald Lanz,et al.  Attention is All We Need: Nailing Down Object-centric Attention for Egocentric Activity Recognition , 2018, BMVC.

[56]  Alex Pentland,et al.  Human computing and machine understanding of human behavior: a survey , 2006, ICMI '06.