论文信息 - Pooling Transformer for Detection of Risk Events in In-The-Wild Video Ego Data

Pooling Transformer for Detection of Risk Events in In-The-Wild Video Ego Data

The paper proposes a video transformer architecture for detection of risk events on frail adults with ego video monitoring data. First we introduce an extended taxonomy for risk events, and then we propose a transformer based video recognition model for detection of these risk events. The proposed transformer architecture consists of separable attention for spatial and temporal data. We also introduce a pooling operation on the temporal video data by learning of their importance. The experiments have been conducted on visual data of in-the-wild recorded BIRDS dataset and on Kinetics-400 for benchmarking. The use of the pooling operation in transformers gives an increment of 3% on BIRDS dataset.

[1] Jenny Benois-Pineau,et al. A GRU Neural Network with attention mechanism for detection of risk situations on multimodal lifelog data , 2021, 2021 International Conference on Content-Based Multimedia Indexing (CBMI).

[2] Stephen Lin,et al. Video Swin Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Xun Guo,et al. SSAN: Separable Self-Attention Network for Video Representation Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Ivan Marsic,et al. VidTr: Video Transformer Without Convolutions , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[5] Christoph Feichtenhofer,et al. Multiscale Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[6] Cordelia Schmid,et al. ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[7] M. Ryoo,et al. Coarse-Fine Networks for Temporal Activity Detection in Videos , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Heng Wang,et al. Is Space-Time Attention All You Need for Video Understanding? , 2021, ICML.

[9] Jean-Baptiste Alayrac,et al. Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers , 2021, Transactions of the Association for Computational Linguistics.

[10] Pieter Abbeel,et al. Bottleneck Transformers for Visual Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[12] Nicolas Usunier,et al. End-to-End Object Detection with Transformers , 2020, ECCV.

[13] Thanos G. Stavropoulos,et al. IoT Wearable Sensors and Devices in Elderly Care: A Literature Review , 2020, Sensors.

[14] Christoph Feichtenhofer,et al. X3D: Expanding Architectures for Efficient Video Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Noel E. O'Connor,et al. HealthMedia'19: 4th International Workshop on Multimedia for Personal Health and Health Care , 2019, ACM Multimedia.

[16] Jenny Benois-Pineau,et al. Multi-sensing of fragile persons for risk situation detection: devices, methods, challenges , 2019, 2019 International Conference on Content-Based Multimedia Indexing (CBMI).

[17] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[18] Farhaan Mirza,et al. A Systematic Review of Wearable Sensors and IoT-Based Monitoring Applications for Older Adults – a Focus on Ageing Population and Independent Living , 2019, Journal of Medical Systems.

[19] Heng Wang,et al. Video Classification With Channel-Separated Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20] Thobias Sando,et al. GIS-based Spatial and Temporal Analysis of Aging-Involved Accidents: a Case Study of Three Counties in Florida , 2017 .

[21] Abhinav Gupta,et al. Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[23] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24] U. Lindemann,et al. Sit-to-Stand Transition Reveals Acute Fall Risk in Activities of Daily Living , 2016, IEEE Journal of Translational Engineering in Health and Medicine.

[25] Georgios Meditskos,et al. Semantic Event Fusion of Different Visual Modality Concepts for Activity Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26] Tao Mei,et al. Action Recognition by Learning Deep Multi-Granular Spatio-Temporal Video Representation , 2016, ICMR.

[27] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[28] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[29] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[30] Ming Yang,et al. 3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31] H. Amièva,et al. Frailty among community-dwelling elderly people in France: the three-city study. , 2008, The journals of gerontology. Series A, Biological sciences and medical sciences.

[32] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[33] Thinhinane Yebda,et al. Multimodal Sensor Data Analysis for Detection of Risk Situations of Fragile People in @home Environments , 2021, MMM.

[34] Mufti Mahmud,et al. Machine Learning Based Early Fall Detection for Elderly People with Neurological Disorder Using Multimodal Data Fusion , 2020, BI.

[35] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.