FLGR: Fixed Length Gists Representation Learning for RNN-HMM Hybrid-Based Neuromorphic Continuous Gesture Recognition

A neuromorphic vision sensors is a novel passive sensing modality and frameless sensors with several advantages over conventional cameras. Frame-based cameras have an average frame-rate of 30 fps, causing motion blur when capturing fast motion, e.g., hand gesture. Rather than wastefully sending entire images at a fixed frame rate, neuromorphic vision sensors only transmit the local pixel-level changes induced by the movement in a scene when they occur. This leads to advantageous characteristics, including low energy consumption, high dynamic range, a sparse event stream and low response latency. In this study, a novel representation learning method was proposed: Fixed Length Gists Representation (FLGR) learning for event-based gesture recognition. Previous methods accumulate events into video frames in a time duration (e.g., 30 ms) to make the accumulated image-level representation. However, the accumulated-frame-based representation waives the friendly event-driven paradigm of neuromorphic vision sensor. New representation are urgently needed to fill the gap in non-accumulated-frame-based representation and exploit the further capabilities of neuromorphic vision. The proposed FLGR is a sequence learned from mixture density autoencoder and preserves the nature of event-based data better. FLGR has a data format of fixed length, and it is easy to feed to sequence classifier. Moreover, an RNN-HMM hybrid was proposed to address the continuous gesture recognition problem. Recurrent neural network (RNN) was applied for FLGR sequence classification while hidden Markov model (HMM) is employed for localizing the candidate gesture and improving the result in a continuous sequence. A neuromorphic continuous hand gestures dataset (Neuro ConGD Dataset) was developed with 17 hand gestures classes for the community of the neuromorphic research. Hopefully, FLGR can inspire the study on the event-based highly efficient, high-speed, and high-dynamic-range sequence classification tasks.

[1]  Wulfram Gerstner,et al.  SPIKING NEURON MODELS Single Neurons , Populations , Plasticity , 2002 .

[2]  Davide Scaramuzza,et al.  Lifetime estimation of events from Dynamic Vision Sensors , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[3]  Shih-Chii Liu,et al.  Phased LSTM: Accelerating Recurrent Network Training for Long or Event-based Sequences , 2016, NIPS.

[4]  Tobi Delbruck,et al.  Robotic goalie with 3 ms reaction time at 4% CPU load using event-based dynamic vision sensor , 2013, Front. Neurosci..

[5]  Davide Scaramuzza,et al.  Ultimate SLAM? Combining Events, Images, and IMU for Robust Visual SLAM in HDR and High-Speed Scenarios , 2017, IEEE Robotics and Automation Letters.

[6]  Stefan Leutenegger,et al.  Simultaneous Optical Flow and Intensity Estimation from an Event Camera , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[8]  Sander Dieleman,et al.  Beyond Temporal Pooling: Recurrence and Temporal Convolutions for Gesture Recognition in Video , 2015, International Journal of Computer Vision.

[9]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[10]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Christian Wolf,et al.  Multi-scale Deep Learning for Gesture Detection and Localization , 2014, ECCV Workshops.

[12]  Hermann Ney,et al.  Deep Hand: How to Train a CNN on 1 Million Hand Images When Your Data is Continuous and Weakly Labelled , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[14]  T. Delbruck,et al.  > Replace This Line with Your Paper Identification Number (double-click Here to Edit) < 1 , 2022 .

[15]  Guy Lapalme,et al.  A systematic analysis of performance measures for classification tasks , 2009, Inf. Process. Manag..

[16]  Lei Deng,et al.  Spatio-Temporal Backpropagation for Training High-Performance Spiking Neural Networks , 2017, Front. Neurosci..

[17]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[18]  Sergio Escalera,et al.  RGB-D-based Human Motion Recognition with Deep Learning: A Survey , 2017, Comput. Vis. Image Underst..

[19]  Tobi Delbrück,et al.  Steering a predator robot using a mixed frame/event-driven convolutional neural network , 2016, 2016 Second International Conference on Event-based Control, Communication, and Signal Processing (EBCCSP).

[20]  Rekha Lathi,et al.  Dynamic Vision Sensor Camera Based Bare Hand Gesture Recognition , 2012 .

[21]  Karthik Ramani,et al.  DeepHand: Robust Hand Pose Estimation by Completing a Matrix Imputed with Deep Features , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Narciso García,et al.  Event-Based Vision Meets Deep Learning on Steering Prediction for Self-Driving Cars , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Tobi Delbrück,et al.  A Low Power, Fully Event-Based Gesture Recognition System , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Hongjie Liu,et al.  DVS Benchmark Datasets for Object Tracking, Action Recognition, and Object Recognition , 2016, Front. Neurosci..

[25]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[26]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Wojciech Zaremba,et al.  An Empirical Exploration of Recurrent Network Architectures , 2015, ICML.

[28]  Tobi Delbrück,et al.  Touchless hand gesture UI with instantaneous responses , 2012, 2012 19th IEEE International Conference on Image Processing.

[29]  Ling Shao,et al.  Deep Dynamic Neural Networks for Multimodal Gesture Segmentation and Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Elena Mugellini,et al.  A Survey of Datasets for Human Gesture Recognition , 2014, HCI.

[31]  Tobi Delbrück,et al.  Real-Time Gesture Interface Based on Event-Driven Processing From Stereo Silicon Retinas , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[32]  Tobi Delbrück,et al.  A 128$\times$ 128 120 dB 15 $\mu$s Latency Asynchronous Temporal Contrast Vision Sensor , 2008, IEEE Journal of Solid-State Circuits.

[33]  Tobi Delbrück,et al.  Live demonstration: Gesture-based remote control using stereo pair of dynamic vision sensors , 2012, 2012 IEEE International Symposium on Circuits and Systems.

[34]  Christian Wolf,et al.  ModDrop: Adaptive Multi-Modal Gesture Recognition , 2014, IEEE Trans. Pattern Anal. Mach. Intell..

[35]  Mohan M. Trivedi,et al.  Hand Gesture Recognition in Real Time for Automotive Interfaces: A Multimodal Vision-Based Approach and Evaluations , 2014, IEEE Transactions on Intelligent Transportation Systems.

[36]  Qiang Wang,et al.  Performance improvement of deep learning based gesture recognition using spatiotemporal demosaicing technique , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[37]  Kanchan Gaikwad HMM Classifier for Human Activity Recognition , 2012 .

[38]  Xilin Chen,et al.  Two streams Recurrent Neural Networks for Large-Scale Continuous Gesture Recognition , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[39]  Pavlo Molchanov,et al.  Hand gesture recognition with 3D convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[40]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[41]  Thomas Pock,et al.  Real-Time Intensity-Image Reconstruction for Event Cameras Using Manifold Regularisation , 2016, International Journal of Computer Vision.

[42]  Changshui Zhang,et al.  Recurrent Convolutional Neural Networks for Continuous Sign Language Recognition by Staged Optimization , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Davide Scaramuzza,et al.  Event-based, 6-DOF pose tracking for high-speed maneuvers , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[44]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[45]  John J. Leonard,et al.  Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age , 2016, IEEE Transactions on Robotics.