Fast Retinomorphic Event-Driven Representations for Video Gameplay and Action Recognition

Good temporal representations are crucial for video understanding, and the state-of-the-art video recognition framework is based on two-stream networks. In such framework, besides the regular ConvNets responsible for RGB frame inputs, a second network is introduced to handle the temporal representation, usually the optical flow (OF). However, OF or other task-oriented flow is computationally costly, and is thus typically pre-computed. Critically, this prevents the two-stream approach from being applied to reinforcement learning (RL) applications such as video game playing, where the next state depends on current state and action choices. Inspired by the early vision systems of mammals and insects, we propose a fast event-driven representation (EDR) that models several major properties of early retinal circuits: (1) logarithmic input response, (2) multi-timescale temporal smoothing to filter noise, and (3) bipolar (ON/OFF) pathways for primitive event detection. Trading off the directional information for fast speed ($>$9000 fps), EDR enables fast real-time inference/learning in video applications that require interaction between an agent and the world such as game-playing, virtual robotics, and domain adaptation. In this vein, we use EDR to demonstrate performance improvements over state-of-the-art reinforcement learning algorithms for Atari games, something that has not been possible with pre-computed OF. Moreover, with UCF-101 video action recognition experiments, we show that EDR performs near state-of-the-art in accuracy while achieving a  1,500x speedup in input representation processing, as compared to optical flow.

[1]  Surya Ganguli,et al.  Inferring hidden structure in multilayered neural circuits , 2017, bioRxiv.

[2]  Tobi Delbruck,et al.  A 240×180 10mW 12us latency sparse-output vision sensor for mobile applications , 2013, 2013 Symposium on VLSI Circuits.

[3]  Ralph Etienne-Cummings,et al.  Neuromorphic vision sensors , 1996 .

[4]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Michael J. Berry,et al.  Adaptation of retinal processing to image contrast and spatial scale , 1997, Nature.

[6]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[7]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[8]  David H Brainard,et al.  Simulation of visual perception and learning with a retinal prosthesis , 2018, bioRxiv.

[9]  Elsayed E. Hemayed,et al.  Human action recognition using trajectory-based representation , 2015 .

[10]  Bernabé Linares-Barranco,et al.  Mapping from Frame-Driven to Frame-Free Event-Driven Vision Systems by Low-Rate Rate Coding and Coincidence Processing--Application to Feedforward ConvNets , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Ryad Benosman,et al.  HATS: Histograms of Averaged Time Surfaces for Robust Event-Based Object Classification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[14]  Alexander Borst,et al.  Optogenetic and Pharmacologic Dissection of Feedforward Inhibition in Drosophila Motion Vision , 2014, The Journal of Neuroscience.

[15]  Davide Scaramuzza,et al.  Event-based, 6-DOF pose tracking for high-speed maneuvers , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[16]  Thomas Brox,et al.  High Accuracy Optical Flow Estimation Based on a Theory for Warping , 2004, ECCV.

[17]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[18]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[20]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Alexander S. Ecker,et al.  Principles of connectivity among morphologically defined cell types in adult neocortex , 2015, Science.

[23]  Stefan Leutenegger,et al.  Real-Time 3D Reconstruction and 6-DoF Tracking with an Event Camera , 2016, ECCV.

[24]  Jacob Loveless,et al.  Online algorithms in high-frequency trading , 2013, Commun. ACM.

[25]  Gunnar Farnebäck,et al.  Two-Frame Motion Estimation Based on Polynomial Expansion , 2003, SCIA.

[26]  Mohit Gupta,et al.  MC3D: Motion Contrast 3D Scanning , 2015, 2015 IEEE International Conference on Computational Photography (ICCP).

[27]  H. Kolb,et al.  Intracellular staining reveals different levels of stratification for on- and off-center ganglion cells in cat retina. , 1978, Journal of neurophysiology.

[28]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Thomas Pock,et al.  Real-time panoramic tracking for event cameras , 2017, 2017 IEEE International Conference on Computational Photography (ICCP).

[30]  Arjun Chandra,et al.  Efficient Parallel Methods for Deep Reinforcement Learning , 2017, ArXiv.

[31]  Tobi Delbruck,et al.  A 240 × 180 130 dB 3 µs Latency Global Shutter Spatiotemporal Vision Sensor , 2014, IEEE Journal of Solid-State Circuits.

[32]  E. V. Famiglietti,et al.  Structural basis for ON-and OFF-center responses in retinal ganglion cells. , 1976, Science.

[33]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[34]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[35]  Davide Scaramuzza,et al.  EVO: A Geometric Approach to Event-Based 6-DOF Parallel Tracking and Mapping in Real Time , 2017, IEEE Robotics and Automation Letters.

[36]  M. Tachibana,et al.  A Key Role of Starburst Amacrine Cells in Originating Retinal Directional Selectivity and Optokinetic Eye Movement , 2001, Neuron.

[37]  Chuang Gan,et al.  End-to-End Learning of Motion Representation for Video Understanding , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  John J. Leonard,et al.  Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age , 2016, IEEE Transactions on Robotics.

[39]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[40]  Ashok Veeraraghavan,et al.  Direct face detection and video reconstruction from event cameras , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[41]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[42]  Kwabena Boahen Retinomorphic vision systems , 1996, Proceedings of Fifth International Conference on Microelectronics for Neural Networks.

[43]  Christian Wolf,et al.  Sequential Deep Learning for Human Action Recognition , 2011, HBU.

[44]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[45]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[46]  Tobi Delbrück,et al.  Retinomorphic Event-Based Vision Sensors: Bioinspired Cameras With Spiking Output , 2014, Proceedings of the IEEE.

[47]  M. Bethge,et al.  Spikes in Mammalian Bipolar Cells Support Temporal Layering of the Inner Retina , 2013, Current Biology.

[48]  W. A. Hagins,et al.  Kinetics of the photocurrent of retinal rods. , 1972, Biophysical journal.

[49]  Max Welling,et al.  Sigma Delta Quantized Networks , 2016, ICLR.

[50]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[51]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[52]  A. Borst,et al.  Common circuit design in fly and mammalian motion vision , 2015, Nature Neuroscience.

[53]  Christopher Joseph Pal,et al.  Delving Deeper into Convolutional Networks for Learning Video Representations , 2015, ICLR.

[54]  W. Stell,et al.  Structural basis for on-and off-center responses in retinal bipolar cells. , 1977, Science.

[55]  Jiajun Wu,et al.  Video Enhancement with Task-Oriented Flow , 2018, International Journal of Computer Vision.

[56]  Davide Scaramuzza,et al.  EMVS: Event-based Multi-View Stereo , 2016, BMVC.

[57]  Javier Sánchez Pérez,et al.  TV-L1 Optical Flow Estimation , 2013, Image Process. Line.

[58]  Daniel Matolin,et al.  A QVGA 143 dB Dynamic Range Frame-Free PWM Image Sensor With Lossless Pixel-Level Video Compression and Time-Domain CDS , 2011, IEEE Journal of Solid-State Circuits.

[59]  Stefan Leutenegger,et al.  Simultaneous Optical Flow and Intensity Estimation from an Event Camera , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[61]  Andreas G. Andreou,et al.  Analog VLSI neuromorphic image acquisition and pre-processing systems , 1995, Neural Networks.

[62]  G. Sperling,et al.  The functional architecture of human visual motion perception , 1995, Vision Research.

[63]  Alexander J. Smola,et al.  Compressed Video Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[64]  K. Boahen Neuromorphic Microchips. , 2005, Scientific American.

[65]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[66]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[67]  Albert Wang,et al.  A 180nm CMOS image sensor with on-chip optoelectronic image compression , 2012, Proceedings of the IEEE 2012 Custom Integrated Circuits Conference.

[68]  Tobi Delbrück,et al.  A 128$\times$ 128 120 dB 15 $\mu$s Latency Asynchronous Temporal Contrast Vision Sensor , 2008, IEEE Journal of Solid-State Circuits.

[69]  P. Cavanagh,et al.  Cortical fMRI activation produced by attentive tracking of moving targets. , 1998, Journal of neurophysiology.

[70]  T. Delbruck,et al.  > Replace This Line with Your Paper Identification Number (double-click Here to Edit) < 1 , 2022 .

[71]  H R Blackwell,et al.  Rod and cone receptor mechanisms in typical and atypical congenital achromatopsia , 1961 .

[72]  Kwabena Boahen,et al.  A silicon retina that reproduces signals in the optic nerve , 2006, Journal of neural engineering.

[73]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[74]  Ivan Laptev,et al.  Efficient Feature Extraction, Encoding, and Classification for Action Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.