Deep metric learning for open-set human action recognition in videos

Human action recognition (HAR) is a topic widely studied in computer vision and pattern recognition. Despite the success of recent models for this issue, most of them approach HAR from the closed-set perspective. The closed-set recognition works under the assumption that all classes are known a priori and they appear during the training and test phase. Unlike most previous works, we approach HAR from the open-set perspective, that is, previously unknown classes are considered in the model. Additionally, feature extraction for HAR in the context of open set is still underexplored in the recent literature, since one needs to represent known classes with a low intra-class variance to reject unknown examples. To achieve this task, we propose a deep metric learning model named triplet inflated 3D convolutional neural network (TI3D), which builds upon the well-known I3D model. TI3D is a representation learning model that takes as input video sequences and outputs 256-dimensional representations. We perform extensive experiments and statistical comparisons on the UCF-101 dataset using a 30-fold cross-validation procedure in 25 different scenarios with varying degrees of openness and a varying number of training and test classes. Results reveal that the proposed TI3D achieves better performance than non-metric learning models in terms of $$F_1$$ F 1 score and Youdens index, indicating a promising approach for open-set video action recognition.

[1]  Li Zhang,et al.  Learning similarity with cosine similarity ensemble , 2015, Inf. Sci..

[2]  He Bai,et al.  DKD–DAD: a novel framework with discriminative kinematic descriptor and deep attention-pooled descriptor for action recognition , 2019, Neural Computing and Applications.

[3]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[5]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[6]  Alexander J. Smola,et al.  Compressed Video Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Houqiang Li,et al.  Low-Latency Human Action Recognition with Weighted Multi-Region Convolutional Neural Network , 2018, ArXiv.

[8]  Tom Drummond,et al.  The Importance of Metric Learning for Robotic Vision: Open Set Recognition and Active Learning , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[9]  Tom Dhaene,et al.  Indoor human activity recognition using high-dimensional sensors and deep neural networks , 2019, Neural Computing and Applications.

[10]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Terrance E. Boult,et al.  Towards Open World Recognition , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Bhiksha Raj,et al.  SphereFace: Deep Hypersphere Embedding for Face Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Terrance E. Boult,et al.  The Extreme Value Machine , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Limin Wang,et al.  Appearance-and-Relation Networks for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Yonghong Tian,et al.  ODN: Opening the Deep Network for Open-Set Action Recognition , 2018, 2018 IEEE International Conference on Multimedia and Expo (ICME).

[16]  Rainer Stiefelhagen,et al.  Informed Democracy: Voting-based Novelty Detection for Action Recognition , 2018, BMVC.

[17]  Jiwen Lu,et al.  Deep Metric Learning for Visual Understanding: An Overview of Recent Advances , 2017, IEEE Signal Processing Magazine.

[18]  Juergen Gall,et al.  Open Set Domain Adaptation for Image and Action Recognition , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Philip S. Yu,et al.  Open-world Learning and Application to Product Classification , 2018, WWW.

[20]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[21]  Nir Ailon,et al.  Deep Metric Learning Using Triplet Network , 2014, SIMBAD.

[22]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Jonghyun Choi,et al.  ActionFlowNet: Learning Motion Representation for Action Recognition , 2016, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[24]  Songcan Chen,et al.  Recent Advances in Open Set Recognition: A Survey , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[26]  Chunping Hou,et al.  Open-set human activity recognition based on micro-Doppler signatures , 2019, Pattern Recognit..

[27]  Carlos D. Castillo,et al.  Deep Learning for Understanding Faces: Machines May Be Just as Good, or Better, than Humans , 2018, IEEE Signal Processing Magazine.

[28]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[29]  Terrance E. Boult,et al.  Probability Models for Open Set Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Shuicheng Yan,et al.  Multi-Fiber Networks for Video Recognition , 2018, ECCV.

[31]  Brian D. Rigling,et al.  Open set recognition for automatic target classification with rejection , 2016, IEEE Transactions on Aerospace and Electronic Systems.

[32]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[35]  Yi Zhu,et al.  Hidden Two-Stream Convolutional Networks for Action Recognition , 2017, ACCV.

[36]  Heitor Silvério Lopes,et al.  A study of deep convolutional auto-encoders for anomaly detection in videos , 2018, Pattern Recognit. Lett..

[37]  Lei Shu,et al.  DOC: Deep Open Classification of Text Documents , 2017, EMNLP.

[38]  Lin Wu,et al.  Where-and-When to Look: Deep Siamese Attention Networks for Video-Based Person Re-Identification , 2018, IEEE Transactions on Multimedia.

[39]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.

[40]  Nouzha Harrati,et al.  Human activity recognition via optical flow: decomposing activities into basic actions , 2019, Neural Computing and Applications.

[41]  W. Youden,et al.  Index for rating diagnostic tests , 1950, Cancer.

[42]  Anderson Rocha,et al.  Toward Open Set Recognition , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Terrance E. Boult,et al.  Multi-class Open Set Recognition Using Probability of Inclusion , 2014, ECCV.

[44]  Apostol Natsev,et al.  Collaborative Deep Metric Learning for Video Understanding , 2018, KDD.

[45]  Dima Damen,et al.  Recognizing linked events: Searching the space of feasible explanations , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Matheus Gutoski,et al.  A clustering-based deep autoencoder for one-class image classification , 2017, 2017 IEEE Latin American Conference on Computational Intelligence (LA-CCI).

[47]  Terrance E. Boult,et al.  Towards Open Set Deep Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[49]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[50]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[51]  Andreas Krause,et al.  Advances in Neural Information Processing Systems (NIPS) , 2014 .

[52]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[53]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[54]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[55]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[56]  Wenting Li,et al.  RegFrame: fast recognition of simple human actions on a stand-alone mobile device , 2018, Neural Computing and Applications.

[57]  Hasan Şakir Bilge,et al.  Deep Metric Learning: A Survey , 2019, Symmetry.

[58]  Chandan Srivastava,et al.  Support Vector Data Description , 2011 .

[59]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[61]  Akif Durdu,et al.  Human action recognition with bag of visual words using different machine learning methods and hyperparameter optimization , 2019, Neural Computing and Applications.

[62]  Xing Ji,et al.  CosFace: Large Margin Cosine Loss for Deep Face Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[63]  Songcan Chen,et al.  Collective Decision for Open Set Recognition , 2018, IEEE Transactions on Knowledge and Data Engineering.

[64]  Yanbing Xue,et al.  Human action recognition on depth dataset , 2015, Neural Computing and Applications.

[65]  Stefano Berretti,et al.  A Novel Geometric Framework on Gram Matrix Trajectories for Human Behavior Understanding , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[66]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .