Sequence-to-Sequence Modeling for Action Identification at High Temporal Resolution

Automatic action identification from video and kinematic data is an important machine learning problem with applications ranging from robotics to smart health. Most existing works focus on identifying coarse actions such as running, climbing, or cutting a vegetable, which have relatively long durations. This is an important limitation for applications that require identification of subtle motions at high temporal resolution. For example, in stroke recovery, quantifying rehabilitation dose requires differentiating motions with sub-second durations. Our goal is to bridge this gap. To this end, we introduce a large-scale, multimodal dataset, StrokeRehab, as a new action-recognition benchmark that includes subtle short-duration actions labeled at a high temporal resolution. These short-duration actions are called functional primitives, and consist of reaches, transports, repositions, stabilizations, and idles. The dataset consists of high-quality Inertial Measurement Unit sensors and video data of 41 stroke-impaired patients performing activities of daily living like feeding, brushing teeth, etc. We show that current state-of-the-art models based on segmentation produce noisy predictions when applied to these data, which often leads to overcounting of actions. To address this, we propose a novel approach for high-resolution action identification, inspired by speech-recognition techniques, which is based on a sequence-to-sequence model that directly predicts the sequence of actions. This approach outperforms current state-of-the-art methods on the StrokeRehab dataset, as well as on the standard benchmark datasets 50Salads, Breakfast, and Jigsaws.

[1]  Joel Stein,et al.  Executive summary: heart disease and stroke statistics--2014 update: a report from the American Heart Association. , 2014, Circulation.

[2]  Theresa A. Jones,et al.  Training Intensity Affects Motor Rehabilitation Efficacy Following Unilateral Ischemic Insult of the Sensorimotor Cortex in C57BL/6 Mice , 2015, Neurorehabilitation and neural repair.

[3]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[4]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  F. J. Carod-Artal,et al.  Quality of Life after Stroke: The Importance of a Good Recovery , 2009, Cerebrovascular Diseases.

[6]  Faicel Chamroukhi,et al.  Physical Human Activity Recognition Using Wearable Sensors , 2015, Sensors.

[7]  Yazan Abu Farha,et al.  MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Joseph P Broderick,et al.  William M. Feinberg Lecture: stroke therapy in the year 2025: burden, breakthroughs, and barriers to progress. , 2003, Stroke.

[9]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[10]  Zhifeng Li,et al.  Boundary-Aware Cascade Networks for Temporal Action Segmentation , 2020, ECCV.

[11]  Tailai Wen,et al.  Time Series Anomaly Detection Using Convolutional Neural Networks and Transfer Learning , 2019, ArXiv.

[12]  Limin Wang,et al.  Temporal Segment Networks for Action Recognition in Videos , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Jiyoun Lim,et al.  Sensor Data Acquisition and Multimodal Sensor Fusion for Human Activity Recognition Using Deep Learning , 2019, Sensors.

[14]  Henry C. Lin,et al.  JHU-ISI Gesture and Skill Assessment Working Set ( JIGSAWS ) : A Surgical Activity Dataset for Human Motion Modeling , 2014 .

[15]  Christoph Feichtenhofer,et al.  X3D: Expanding Architectures for Efficient Video Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[17]  Bolei Zhou,et al.  Temporal Relational Reasoning in Videos , 2017, ECCV.

[18]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).