Improved Action Recognition with Separable Spatio-Temporal Attention Using Alternative Skeletal and Video Pre-Processing

The potential benefits of recognising activities of daily living from video for active and assisted living have yet to be fully untapped. These technologies can be used for behaviour understanding, and lifelogging for caregivers and end users alike. The recent publication of realistic datasets for this purpose, such as the Toyota Smarthomes dataset, calls for pushing forward the efforts to improve action recognition. Using the separable spatio-temporal attention network proposed in the literature, this paper introduces a view-invariant normalisation of skeletal pose data and full activity crops for RGB data, which improve the baseline results by 9.5% (on the cross-subject experiments), outperforming state-of-the-art techniques in this field when using the original unmodified skeletal data in dataset. Our code and data are available online.

[1]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Bingbing Ni,et al.  RGBD-HuDaAct: A color-depth video database for human daily activity recognition , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[3]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[4]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[5]  Michael Firman,et al.  RGBD Datasets: Past, Present and Future , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[6]  Cordelia Schmid,et al.  LCR-Net++: Multi-Person 2D and 3D Pose Detection in Natural Images , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Ying Wu,et al.  Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Behrooz Mahasseni,et al.  Regularizing Long Short Term Memory with 3D Human-Skeleton Sequences for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Ying Wu,et al.  Cross-View Action Modeling, Learning, and Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  J. Olsen,et al.  The European Commission , 2020, The European Union.

[12]  D. Holdstock Past, present--and future? , 2005, Medicine, conflict, and survival.

[13]  Michael S. Ryoo,et al.  AssembleNet++: Assembling Modality Representations via Attention Connections , 2020, ECCV.

[14]  Iasonas Kokkinos,et al.  DensePose: Dense Human Pose Estimation in the Wild , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Bart Selman,et al.  Unstructured human activity detection from RGBD images , 2011, 2012 IEEE International Conference on Robotics and Automation.

[16]  Michael S. Ryoo,et al.  AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures , 2019, ICLR.

[17]  B. Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[18]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Andrew Blake,et al.  Efficient Human Pose Estimation from Single Depth Images , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[21]  Nasser Kehtarnavaz,et al.  UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[22]  Р Ю Чуйков,et al.  Обнаружение транспортных средств на изображениях загородных шоссе на основе метода Single shot multibox Detector , 2017 .

[23]  Cewu Lu,et al.  RMPE: Regional Multi-person Pose Estimation , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[24]  Gang Wang,et al.  NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Lei Shi,et al.  Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Rui Dai,et al.  VPN: Learning Video-Pose Embedding for Activities of Daily Living , 2020, ECCV.

[27]  Alexandros André Chaaraoui,et al.  An Efficient Approach for Multi-view Human Action Recognition Based on Bag-of-Key-Poses , 2012, HBU.

[28]  Monique Thonnat,et al.  Where to Focus on for Human Action Recognition? , 2019, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[29]  Gianpiero Francesca,et al.  Selective Spatio-Temporal Aggregation Based Pose Refinement System: Towards Understanding Human Activities in Real-World Videos , 2020, ArXiv.

[30]  Michal Koperski,et al.  Toyota Smarthome: Real-World Activities of Daily Living , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Paolo Sernani,et al.  Exploring the ambient assisted living domain: a systematic review , 2017, J. Ambient Intell. Humaniz. Comput..

[32]  Wanqing Li,et al.  Action recognition based on a bag of 3D points , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[33]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Michal Koperski,et al.  Deep-Temporal LSTM for Daily Living Action Recognition , 2018, 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[35]  Alexandros André Chaaraoui,et al.  Evolutionary joint selection to improve human action recognition with RGB-D devices , 2014, Expert Syst. Appl..