Fast Retinomorphic Event Stream for Video Recognition and Reinforcement Learning

Good temporal representations are crucial for video understanding, and the state-of-the-art video recognition framework is based on two-stream networks. In such framework, besides the regular ConvNets responsible for RGB frame inputs, a second network is introduced to handle the temporal representation, usually the optical flow (OF). However, OF or other task-oriented flow is computationally costly, and is thus typically pre-computed. Critically, this prevents the two-stream approach from being applied to reinforcement learning (RL) applications such as video game playing, where the next state depends on current state and action choices. Inspired by the early vision systems of mammals and insects, we propose a fast event-driven representation (EDR) that models several major properties of early retinal circuits: (1) logarithmic input response, (2) multi-timescale temporal smoothing to filter noise, and (3) bipolar (ON/OFF) pathways for primitive event detection[12]. Trading off the directional information for fast speed (> 9000 fps), EDR en-ables fast real-time inference/learning in video applications that require interaction between an agent and the world such as game-playing, virtual robotics, and domain adaptation. In this vein, we use EDR to demonstrate performance improvements over state-of-the-art reinforcement learning algorithms for Atari games, something that has not been possible with pre-computed OF. Moreover, with UCF-101 video action recognition experiments, we show that EDR performs near state-of-the-art in accuracy while achieving a 1,500x speedup in input representation processing, as compared to optical flow.

[1]  Stefan Leutenegger,et al.  Simultaneous Optical Flow and Intensity Estimation from an Event Camera , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Tobi Delbruck,et al.  A 240 × 180 130 dB 3 µs Latency Global Shutter Spatiotemporal Vision Sensor , 2014, IEEE Journal of Solid-State Circuits.

[3]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[5]  Andreas G. Andreou,et al.  Analog VLSI neuromorphic image acquisition and pre-processing systems , 1995, Neural Networks.

[6]  E. V. Famiglietti,et al.  Structural basis for ON-and OFF-center responses in retinal ganglion cells. , 1976, Science.

[7]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[8]  Yi Zhu,et al.  Hidden Two-Stream Convolutional Networks for Action Recognition , 2017, ACCV.

[9]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[10]  P. Cavanagh,et al.  Cortical fMRI activation produced by attentive tracking of moving targets. , 1998, Journal of neurophysiology.

[11]  W. Stell,et al.  Structural basis for on-and off-center responses in retinal bipolar cells. , 1977, Science.

[12]  G. Boynton,et al.  Global effects of feature-based attention in human visual cortex , 2002, Nature Neuroscience.

[13]  Stefan Leutenegger,et al.  Real-Time 3D Reconstruction and 6-DoF Tracking with an Event Camera , 2016, ECCV.

[14]  Davide Scaramuzza,et al.  EMVS: Event-based Multi-View Stereo , 2016, BMVC.

[15]  Javier Sánchez Pérez,et al.  TV-L1 Optical Flow Estimation , 2013, Image Process. Line.

[16]  Gunnar Farnebäck,et al.  Two-Frame Motion Estimation Based on Polynomial Expansion , 2003, SCIA.

[17]  Mohit Gupta,et al.  MC3D: Motion Contrast 3D Scanning , 2015, 2015 IEEE International Conference on Computational Photography (ICCP).

[18]  T. Delbruck,et al.  > Replace This Line with Your Paper Identification Number (double-click Here to Edit) < 1 , 2022 .

[19]  H R Blackwell,et al.  Rod and cone receptor mechanisms in typical and atypical congenital achromatopsia , 1961 .

[20]  A. Borst,et al.  Common circuit design in fly and mammalian motion vision , 2015, Nature Neuroscience.

[21]  Christopher Joseph Pal,et al.  Delving Deeper into Convolutional Networks for Learning Video Representations , 2015, ICLR.

[22]  Ashok Veeraraghavan,et al.  Direct face detection and video reconstruction from event cameras , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[23]  Kwabena Boahen,et al.  A silicon retina that reproduces signals in the optic nerve , 2006, Journal of neural engineering.

[24]  H. Kolb,et al.  Intracellular staining reveals different levels of stratification for on- and off-center ganglion cells in cat retina. , 1978, Journal of neurophysiology.

[25]  K. Boahen Neuromorphic Microchips. , 2005, Scientific American.

[26]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[27]  Alexander S. Ecker,et al.  Principles of connectivity among morphologically defined cell types in adult neocortex , 2015, Science.

[28]  Jacob Loveless,et al.  Online algorithms in high-frequency trading , 2013, Commun. ACM.

[29]  Chuang Gan,et al.  End-to-End Learning of Motion Representation for Video Understanding , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Albert Wang,et al.  A 180nm CMOS image sensor with on-chip optoelectronic image compression , 2012, Proceedings of the IEEE 2012 Custom Integrated Circuits Conference.

[32]  Surya Ganguli,et al.  Inferring hidden structure in multilayered neural circuits , 2017, bioRxiv.

[33]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[34]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[36]  Christian Wolf,et al.  Sequential Deep Learning for Human Action Recognition , 2011, HBU.

[37]  M. Tachibana,et al.  A Key Role of Starburst Amacrine Cells in Originating Retinal Directional Selectivity and Optokinetic Eye Movement , 2001, Neuron.

[38]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[39]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[40]  Kwabena Boahen Retinomorphic vision systems , 1996, Proceedings of Fifth International Conference on Microelectronics for Neural Networks.

[41]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[42]  David H Brainard,et al.  Simulation of visual perception and learning with a retinal prosthesis , 2018, bioRxiv.

[43]  Bernabé Linares-Barranco,et al.  Mapping from Frame-Driven to Frame-Free Event-Driven Vision Systems by Low-Rate Rate Coding and Coincidence Processing--Application to Feedforward ConvNets , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Thomas Pock,et al.  Real-time panoramic tracking for event cameras , 2017, 2017 IEEE International Conference on Computational Photography (ICCP).

[46]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[47]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[48]  Ivan Laptev,et al.  Efficient Feature Extraction, Encoding, and Classification for Action Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Davide Scaramuzza,et al.  EVO: A Geometric Approach to Event-Based 6-DOF Parallel Tracking and Mapping in Real Time , 2017, IEEE Robotics and Automation Letters.

[50]  Tobi Delbrück,et al.  Retinomorphic Event-Based Vision Sensors: Bioinspired Cameras With Spiking Output , 2014, Proceedings of the IEEE.

[51]  Daniel Matolin,et al.  A QVGA 143 dB Dynamic Range Frame-Free PWM Image Sensor With Lossless Pixel-Level Video Compression and Time-Domain CDS , 2011, IEEE Journal of Solid-State Circuits.

[52]  M. Bethge,et al.  Spikes in Mammalian Bipolar Cells Support Temporal Layering of the Inner Retina , 2013, Current Biology.

[53]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[54]  Alexander Borst,et al.  Optogenetic and Pharmacologic Dissection of Feedforward Inhibition in Drosophila Motion Vision , 2014, The Journal of Neuroscience.

[55]  Thomas Brox,et al.  High Accuracy Optical Flow Estimation Based on a Theory for Warping , 2004, ECCV.

[56]  Elsayed E. Hemayed,et al.  Human action recognition using trajectory-based representation , 2015 .

[57]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Jiajun Wu,et al.  Video Enhancement with Task-Oriented Flow , 2018, International Journal of Computer Vision.

[59]  G. Sperling,et al.  The functional architecture of human visual motion perception , 1995, Vision Research.

[60]  Ralph Etienne-Cummings,et al.  Neuromorphic vision sensors , 1996 .

[61]  Michael J. Berry,et al.  Adaptation of retinal processing to image contrast and spatial scale , 1997, Nature.

[62]  Tobi Delbruck,et al.  A 240×180 10mW 12us latency sparse-output vision sensor for mobile applications , 2013, 2013 Symposium on VLSI Circuits.

[63]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Arjun Chandra,et al.  Efficient Parallel Methods for Deep Reinforcement Learning , 2017, ArXiv.

[65]  Davide Scaramuzza,et al.  Event-based, 6-DOF pose tracking for high-speed maneuvers , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[66]  John J. Leonard,et al.  Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age , 2016, IEEE Transactions on Robotics.

[67]  W. A. Hagins,et al.  Kinetics of the photocurrent of retinal rods. , 1972, Biophysical journal.

[68]  Max Welling,et al.  Sigma Delta Quantized Networks , 2016, ICLR.

[69]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.