First-Person Action Recognition With Temporal Pooling and Hilbert–Huang Transform

This paper presents a convolutional neural network (CNN)-based approach for first-person action recognition with a combination of temporal pooling and the Hilbert–Huang transform (HHT). The new approach first adaptively performs temporal sub-action localization, treats each channel of the extracted trajectory pooled CNN features as a time series, and summarizes the temporal dynamic information in each sub-action by temporal pooling. The temporal evolution across sub-actions is then modeled by rank pooling. Thereafter, to account for the highly dynamic scene changes in first-person videos, the HHT is employed to decompose the ranked pooling features into finite and often few data-dependent functions, called intrinsic mode functions (IMFs), through empirical mode decomposition. Hilbert spectral analysis is then applied to each IMF component, and four salient descriptors are scrutinized and aggregated into the final video descriptor. Such a framework cannot only precisely acquire both long- and short-term tendencies, but also address the cumbersome significant camera motion in first-person videos to render better accuracy. Furthermore, it works well for complex actions for limited training samples. Simulations show that the proposed approach outperforms the main state-of-the-art methods when applied to four publicly available first-person video datasets.

[1]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Petia Radeva,et al.  Video Segmentation of Life-Logging Videos , 2014, AMDO.

[3]  N. Huang,et al.  The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis , 1998, Proceedings of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[4]  Yan Song,et al.  Concurrence-Aware Long Short-Term Sub-Memories for Person-Person Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[5]  Larry H. Matthies,et al.  First-Person Activity Recognition: What Are They Doing to Me? , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  David Menotti,et al.  First-person action recognition through Visual Rhythm texture description , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  K. W. Cattermole The Fourier Transform and its Applications , 1965 .

[8]  Fatih Ozkan,et al.  Boosted multiple kernel learning for first-person activity recognition , 2017, 2017 25th European Signal Processing Conference (EUSIPCO).

[9]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[10]  Yanning Zhang,et al.  Video Action Recognition Based on Deeper Convolution Networks with Pair-Wise Frame Motion Concatenation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[11]  Susan Tolwinski The Hilbert Transform and Empirical Mode Decomposition as Tools for Data , 2007 .

[12]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[13]  Aapo Hyvärinen,et al.  Independent component analysis of short-time Fourier transforms for spontaneous EEG/MEG analysis , 2010, NeuroImage.

[14]  Kim Dremstrup,et al.  EMD-Based Temporal and Spectral Features for the Classification of EEG Signals Using Supervised Learning , 2016, IEEE Transactions on Neural Systems and Rehabilitation Engineering.

[15]  Tao Mei,et al.  Action Recognition by Learning Deep Multi-Granular Spatio-Temporal Video Representation , 2016, ICMR.

[16]  Deva Ramanan,et al.  First-person pose recognition using egocentric workspaces , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[18]  Juarez Monteiro,et al.  Virtual guide dog: An application to support visually-impaired people through deep convolutional neural networks , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[19]  Zhuowen Tu,et al.  Action Recognition with Actons , 2013, 2013 IEEE International Conference on Computer Vision.

[20]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[21]  Oswald Lanz,et al.  Convolutional Long Short-Term Memory Networks for Recognizing First Person Interactions , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[22]  Shaogang Gong,et al.  Recognising action as clouds of space-time interest points , 2009, CVPR.

[23]  Wen-Hsien Fang,et al.  Temporal aggregation for first-person action recognition using Hilbert-Huang transform , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[24]  Pichao Wang,et al.  Depth Pooling Based Large-Scale 3-D Action Recognition With Convolutional Neural Networks , 2018, IEEE Transactions on Multimedia.

[25]  Nigel H. Lovell,et al.  Gait pattern classification using compact features extracted from intrinsic mode functions , 2008, 2008 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[26]  Handayani Tjandrasa,et al.  Feature extraction using combination of intrinsic mode functions and power spectrum for EEG signal classification , 2016, 2016 9th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI).

[27]  Bingbing Ni,et al.  Motion Part Regularization: Improving action recognition via trajectory group selection , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  A. N. Paithane,et al.  Novel Algorithm for Feature Extraction and Feature Selection from Electrocardiogram Signal , 2016 .

[29]  Miroslaw Bober,et al.  Siamese Network of Deep Fisher-Vector Descriptors for Image Retrieval , 2017, ArXiv.

[30]  Bingbing Ni,et al.  Pose Adaptive Motion Feature Pooling for Human Action Analysis , 2014, International Journal of Computer Vision.

[31]  Luc Van Gool,et al.  An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.

[32]  Andrea Cavallaro,et al.  Robust multi-dimensional motion features for first-person vision activity recognition , 2016, Comput. Vis. Image Underst..

[33]  Lin Sun,et al.  Human Action Recognition Using Factorized Spatio-Temporal Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[34]  Tiejun Huang,et al.  Sequential Deep Trajectory Descriptor for Action Recognition With Three-Stream CNN , 2016, IEEE Transactions on Multimedia.

[35]  Junji Yamato,et al.  Collective First-Person Vision for Automatic Gaze Analysis in Multiparty Conversations , 2017, IEEE Transactions on Multimedia.

[36]  Martial Hebert,et al.  Representing Pairwise Spatial and Temporal Relations for Action Recognition , 2010, ECCV.

[37]  Wei Liu,et al.  Discriminative Multi-instance Multitask Learning for 3D Action Recognition , 2017, IEEE Transactions on Multimedia.

[38]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[39]  Nuno Vasconcelos,et al.  Dynamic Pooling for Complex Event Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[40]  Yan Song,et al.  Global and Local C3D Ensemble System for First Person Interactive Action Recognition , 2018, MMM.

[41]  Andrea Cavallaro,et al.  A Long Short-Term Memory Convolutional Neural Network for First-Person Vision Activity Recognition , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[42]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[43]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Hong Liu,et al.  Robust 3D Action Recognition Through Sampling Local Appearances and Global Distributions , 2018, IEEE Transactions on Multimedia.

[46]  Norden E. Huang,et al.  INTRODUCTION TO THE HILBERT–HUANG TRANSFORM AND ITS RELATED MATHEMATICAL PROBLEMS , 2005 .

[47]  乔宇 Motionlets: Mid-Level 3D Parts for Human Motion Recognition , 2013 .

[48]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Larry H. Matthies,et al.  Pooled motion features for first-person videos , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[51]  Hanqing Lu,et al.  EgoGesture: A New Dataset and Benchmark for Egocentric Hand Gesture Recognition , 2018, IEEE Transactions on Multimedia.

[52]  Mohan S. Kankanhalli,et al.  Action and Interaction Recognition in First-Person Videos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[53]  Tinne Tuytelaars,et al.  Modeling video evolution for action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Andrew Zisserman,et al.  The devil is in the details: an evaluation of recent feature encoding methods , 2011, BMVC.

[55]  Ahmad Mahmoudi Aznaveh,et al.  A Unified Method for First and Third Person Action Recognition , 2018, Electrical Engineering (ICEE), Iranian Conference on.

[56]  Ryo Kurazume,et al.  First-Person Animal Activity Recognition from Egocentric Videos , 2014, 2014 22nd International Conference on Pattern Recognition.

[57]  Alireza Talebpour,et al.  Time series correlation for first-person videos , 2016, 2016 24th Iranian Conference on Electrical Engineering (ICEE).

[58]  Takahiro Okabe,et al.  Fast unsupervised ego-action learning for first-person sports videos , 2011, CVPR 2011.

[59]  Joo-Hwee Lim,et al.  Activity Recognition in Egocentric Life-Logging Videos , 2014, ACCV Workshops.

[60]  Michael S. Ryoo,et al.  Learning Latent Super-Events to Detect Multiple Activities in Videos , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[61]  Stelios Krinidis,et al.  Empirical mode decomposition on skeletonization pruning , 2013, Image Vis. Comput..

[62]  Sridha Sridharan,et al.  Iris Recognition With Off-the-Shelf CNN Features: A Deep Learning Perspective , 2018, IEEE Access.

[63]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[64]  Tinne Tuytelaars,et al.  Rank Pooling for Action Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[65]  Shmuel Peleg,et al.  Compact CNN for indexing egocentric videos , 2015, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[66]  Rami J Oweis,et al.  Seizure classification in EEG signals utilizing Hilbert-Huang transform , 2011, Biomedical engineering online.

[67]  Ajmal S. Mian,et al.  Modeling Sub-Event Dynamics in First-Person Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[69]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[70]  Juarez Monteiro,et al.  Evaluating the Feasibility of Deep Learning for Action Recognition in Small Datasets , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[71]  Shih-Fu Chang,et al.  ConvNet Architecture Search for Spatiotemporal Feature Learning , 2017, ArXiv.