Temporal aggregation for first-person action recognition using Hilbert-Huang transform

This paper presents a new approach for action recognition in the first-person videos which aggregates both of the short- and long-term trends based on the coefficients of the Hilbert-Huang transform (HHT), a renowned time-frequency analysis tool. In contrast to previous works like Pooled Time Series (PoT), the new scheme can extract the salient features of activities based on the non-stationary HHT analysis, which consists of empirical mode decomposition and Hilbert spectral analysis, and can be incorporated with the convolutional neural network (CNN) features such as trajectory pooled CNN features to achieve superior detection accuracy. Conducted simulations show that the proposed method outperforms the main state-of-the-art works on two widespread public firstperson datasets.

[1]  Susan Tolwinski The Hilbert Transform and Empirical Mode Decomposition as Tools for Data , 2007 .

[2]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[3]  Ryo Kurazume,et al.  First-Person Animal Activity Recognition from Egocentric Videos , 2014, 2014 22nd International Conference on Pattern Recognition.

[4]  Alireza Talebpour,et al.  Time series correlation for first-person videos , 2016, 2016 24th Iranian Conference on Electrical Engineering (ICEE).

[5]  Zhizhong Wang,et al.  Mean frequency derived via Hilbert-Huang transform with application to fatigue EMG signal analysis , 2006, Comput. Methods Programs Biomed..

[6]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[7]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[8]  M. S. Woolfson,et al.  Application of empirical mode decomposition to heart rate variability analysis , 2001, Medical and Biological Engineering and Computing.

[9]  Peng Wang,et al.  Temporal Pyramid Pooling-Based Convolutional Neural Network for Action Recognition , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[10]  Larry H. Matthies,et al.  Pooled motion features for first-person videos , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[12]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[13]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  N. Huang,et al.  The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis , 1998, Proceedings of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[15]  Larry H. Matthies,et al.  First-Person Activity Recognition: What Are They Doing to Me? , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Kris M. Kitani,et al.  Going Deeper into First-Person Activity Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Kim Dremstrup,et al.  EMD-Based Temporal and Spectral Features for the Classification of EEG Signals Using Supervised Learning , 2016, IEEE Transactions on Neural Systems and Rehabilitation Engineering.