Efficient Two-stream Action Recognition on FPGA

Action recognition is an important research field that has many applications in surveillance, video search, autonomous vehicles, etc. However, current state-of-the-art action classifiers are still not widely adopted in embedded applications yet. The major reason is that action recognition needs to process both spatial and temporal streaming data to precisely identify actions, which is compute- intensive and power hungry. To solve this issue, researchers start using FPGA to run action recognition models with minimum power. In this paper, we propose a new hardware architecture of action recognition on FPGA. Our model is based on the popular two-stream neural network. By optimizing the optical flow and convolution operations in the temporal domain, our method can achieve similar accuracy with one order of magnitude less operations than other C3D baseline models. We have implemented our model on Xilinx Ultrascale+ ZCU102 and released the source code.

[1]  Robert Bergevin,et al.  Semantic human activity recognition: A literature review , 2015, Pattern Recognit..

[2]  Shijian Lu,et al.  TORNADO: A Spatio-Temporal Convolutional Regression Network for Video Action Proposal , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[3]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[4]  Javier Sánchez Pérez,et al.  TV-L1 Optical Flow Estimation , 2013, Image Process. Line.

[5]  Limin Wang,et al.  Temporal Segment Networks for Action Recognition in Videos , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[7]  Massoud Pedram,et al.  3D CNN Acceleration on FPGA using Hardware-Aware Pruning , 2020, 2020 57th ACM/IEEE Design Automation Conference (DAC).

[8]  Yi Zhu,et al.  Hidden Two-Stream Convolutional Networks for Action Recognition , 2017, ACCV.

[9]  Marco La Cascia,et al.  3D skeleton-based human action classification: A survey , 2016, Pattern Recognit..

[10]  Xinyu Li,et al.  A Comprehensive Study of Deep Video Action Recognition , 2020, ArXiv.

[11]  Bowen Zhang,et al.  Real-Time Action Recognition With Deeply Transferred Motion Vector CNNs , 2018, IEEE Transactions on Image Processing.

[12]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Cees Snoek,et al.  APT: Action localization proposals from dense trajectories , 2015, BMVC.

[14]  Jonghyun Choi,et al.  ActionFlowNet: Learning Motion Representation for Action Recognition , 2016, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[15]  Jianxin Wu,et al.  Towards Real-Time Action Recognition on Mobile Devices Using Deep Models , 2019, ArXiv.

[16]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[17]  Eriko Nurvitadhi,et al.  Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks? , 2017, FPGA.

[18]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[19]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Wayne Luk,et al.  F-E3D: FPGA-based Acceleration of an Efficient 3D Convolutional Neural Network for Human Action Recognition , 2019, 2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[21]  Can Zhang,et al.  PAN: Towards Fast Action Recognition via Learning Persistence of Appearance , 2020, ArXiv.

[22]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[23]  Wayne Luk,et al.  F-C3D: FPGA-based 3-dimensional convolutional neural network , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).

[24]  Petros Daras,et al.  Real-Time Skeleton-Tracking-Based Human Action Recognition Using Kinect Data , 2014, MMM.

[25]  Stephen Neuendorffer,et al.  Demystifying the Lucas-Kanade Optical Flow Algorithm with Vivado HLS , 2009 .

[26]  B. D. Lucas Generalized image matching by the method of differences , 1985 .

[27]  Chuang Gan,et al.  TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[29]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).