Pooling Transformer for Detection of Risk Events in In-The-Wild Video Ego Data

The paper proposes a video transformer architecture for detection of risk events on frail adults with ego video monitoring data. First we introduce an extended taxonomy for risk events, and then we propose a transformer based video recognition model for detection of these risk events. The proposed transformer architecture consists of separable attention for spatial and temporal data. We also introduce a pooling operation on the temporal video data by learning of their importance. The experiments have been conducted on visual data of in-the-wild recorded BIRDS dataset and on Kinetics-400 for benchmarking. The use of the pooling operation in transformers gives an increment of 3% on BIRDS dataset.

[1]  Jenny Benois-Pineau,et al.  A GRU Neural Network with attention mechanism for detection of risk situations on multimodal lifelog data , 2021, 2021 International Conference on Content-Based Multimedia Indexing (CBMI).

[2]  Stephen Lin,et al.  Video Swin Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Xun Guo,et al.  SSAN: Separable Self-Attention Network for Video Representation Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Ivan Marsic,et al.  VidTr: Video Transformer Without Convolutions , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Christoph Feichtenhofer,et al.  Multiscale Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Cordelia Schmid,et al.  ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  M. Ryoo,et al.  Coarse-Fine Networks for Temporal Activity Detection in Videos , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Heng Wang,et al.  Is Space-Time Attention All You Need for Video Understanding? , 2021, ICML.

[9]  Jean-Baptiste Alayrac,et al.  Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers , 2021, Transactions of the Association for Computational Linguistics.

[10]  Pieter Abbeel,et al.  Bottleneck Transformers for Visual Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[12]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[13]  Thanos G. Stavropoulos,et al.  IoT Wearable Sensors and Devices in Elderly Care: A Literature Review , 2020, Sensors.

[14]  Christoph Feichtenhofer,et al.  X3D: Expanding Architectures for Efficient Video Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Noel E. O'Connor,et al.  HealthMedia'19: 4th International Workshop on Multimedia for Personal Health and Health Care , 2019, ACM Multimedia.

[16]  Jenny Benois-Pineau,et al.  Multi-sensing of fragile persons for risk situation detection: devices, methods, challenges , 2019, 2019 International Conference on Content-Based Multimedia Indexing (CBMI).

[17]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[18]  Farhaan Mirza,et al.  A Systematic Review of Wearable Sensors and IoT-Based Monitoring Applications for Older Adults – a Focus on Ageing Population and Independent Living , 2019, Journal of Medical Systems.

[19]  Heng Wang,et al.  Video Classification With Channel-Separated Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Thobias Sando,et al.  GIS-based Spatial and Temporal Analysis of Aging-Involved Accidents: a Case Study of Three Counties in Florida , 2017 .

[21]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[23]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  U. Lindemann,et al.  Sit-to-Stand Transition Reveals Acute Fall Risk in Activities of Daily Living , 2016, IEEE Journal of Translational Engineering in Health and Medicine.

[25]  Georgios Meditskos,et al.  Semantic Event Fusion of Different Visual Modality Concepts for Activity Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Tao Mei,et al.  Action Recognition by Learning Deep Multi-Granular Spatio-Temporal Video Representation , 2016, ICMR.

[27]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[28]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[29]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[30]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  H. Amièva,et al.  Frailty among community-dwelling elderly people in France: the three-city study. , 2008, The journals of gerontology. Series A, Biological sciences and medical sciences.

[32]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[33]  Thinhinane Yebda,et al.  Multimodal Sensor Data Analysis for Detection of Risk Situations of Fragile People in @home Environments , 2021, MMM.

[34]  Mufti Mahmud,et al.  Machine Learning Based Early Fall Detection for Elderly People with Neurological Disorder Using Multimodal Data Fusion , 2020, BI.

[35]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.