Neuromorphic Vision Sensing for CNN-based Action Recognition

Neuromorphic vision sensing (NVS) hardware is now gaining traction as a low-power/high-speed visual sensing technology that circumvents the limitations of conventional active pixel sensing (APS) cameras. While object detection and tracking models have been investigated in conjunction with NVS, there is currently little work on NVS for higher-level semantic tasks, such as action recognition. Contrary to recent work that considers homogeneous transfer between flow domains (optical flow to motion vectors), we propose to embed an NVS emulator into a multi-modal transfer learning framework that carries out heterogeneous transfer from optical flow to NVS. The potential of our framework is showcased by the fact that, for the first time, our NVS-based results achieve comparable action recognition performance to motion-vector or optical-flow based methods (i.e., accuracy on UCF-101 within 8.8% of I3D with optical flow), with the NVS emulator and NVS camera hardware offering 3 to 6 orders of magnitude faster frame generation (respectively) compared to standard Brox optical flow. Beyond this significant advantage, our CNN processing is found to have the lowest total GFLOP count against all competing methods (up to 7.7 times complexity saving compared to I3D with optical flow).

[1]  Bowen Zhang,et al.  Real-Time Action Recognition with Enhanced Motion Vector CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[3]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[4]  Ryad Benosman,et al.  Asynchronous Event-Based Fourier Analysis , 2017, IEEE Transactions on Image Processing.

[5]  Tobi Delbrück,et al.  Steering a predator robot using a mixed frame/event-driven convolutional neural network , 2016, 2016 Second International Conference on Event-based Control, Communication, and Signal Processing (EBCCSP).

[6]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[7]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Tobi Delbruck Neuromorophic vision sensing and processing , 2016, 2016 46th European Solid-State Device Research Conference (ESSDERC).

[9]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[10]  Abhinav Gupta,et al.  ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Eugenio Culurciello,et al.  Activity-driven, event-based vision sensors , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[12]  Alexander J. Smola,et al.  Compressed Video Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Chiara Bartolozzi,et al.  Event-Based Visual Flow , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[14]  Richard P. Wildes,et al.  Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[15]  Patrick Camilleri,et al.  pyDVS: An extensible, real-time Dynamic Vision Sensor emulator using off-the-shelf hardware , 2016, 2016 IEEE Symposium Series on Computational Intelligence (SSCI).

[16]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Tobi Delbruck,et al.  A 240 × 180 130 dB 3 µs Latency Global Shutter Spatiotemporal Vision Sensor , 2014, IEEE Journal of Solid-State Circuits.

[18]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[19]  Gregory Cohen,et al.  Converting Static Image Datasets to Spiking Neuromorphic Datasets Using Saccades , 2015, Front. Neurosci..

[20]  Yiannis Andreopoulos,et al.  PIX2NVS: Parameterized conversion of pixel-domain video frames to neuromorphic vision streams , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[21]  Thomas Brox,et al.  High Accuracy Optical Flow Estimation Based on a Theory for Warping , 2004, ECCV.

[22]  Giacomo Indiveri,et al.  Neuromorphic Engineering , 2015, Handbook of Computational Intelligence.

[23]  Shih-Fu Chang,et al.  ConvNet Architecture Search for Spatiotemporal Feature Learning , 2017, ArXiv.

[24]  Hongjie Liu,et al.  DVS Benchmark Datasets for Object Tracking, Action Recognition, and Object Recognition , 2016, Front. Neurosci..

[25]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).