Toward human-centric deep video understanding

People are the very heart of our daily work and life. As we strive to leverage artificial intelligence to empower every person on the planet to achieve more, we need to understand people far better than we can today. Human–computer interaction plays a significant role in human-machine hybrid intelligence, and human understanding becomes a critical step in addressing the tremendous challenges of video understanding. In this paper, we share our views on why and how to use a human centric approach to address the challenging video understanding problems. We discuss human-centric vision tasks and their status, highlighting the challenges and how our understanding of human brain functions can be leveraged to effectively address some of the challenges. We show that semantic models, view-invariant models, and spatial-temporal visual attention mechanisms are important building blocks. We also discuss the future perspectives of video understanding.

[1]  Wenjun Zeng,et al.  An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data , 2016, AAAI.

[2]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Xiaogang Wang,et al.  End-to-End Learning of Deformable Mixture of Parts and Deep Convolutional Neural Networks for Human Pose Estimation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Xiaogang Wang,et al.  Spindle Net: Person Re-identification with Human Body Region Guided Feature Decomposition and Fusion , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Tao Mei,et al.  Part-Aligned Bilinear Representations for Person Re-identification , 2018, ECCV.

[7]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[8]  Pascal Fua,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 Multiple Object Tracking Using K-shortest Paths Optimization , 2022 .

[9]  Abhinav Gupta,et al.  Zero-Shot Recognition via Semantic Embeddings and Knowledge Graphs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Kaiqi Huang,et al.  Learning Deep Context-Aware Features over Body and Latent Parts for Person Re-identification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Dong Liu,et al.  Human Pose Estimation Using Global and Local Normalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[13]  Wei Wu,et al.  High Performance Visual Tracking with Siamese Region Proposal Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[15]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[16]  Alan L. Yuille,et al.  Articulated Pose Estimation by a Graphical Model with Image Dependent Pairwise Relations , 2014, NIPS.

[17]  D. Ripley,et al.  Vision disturbance after TBI. , 2010, NeuroRehabilitation.

[18]  Luca Bertinetto,et al.  Fully-Convolutional Siamese Networks for Object Tracking , 2016, ECCV Workshops.

[19]  Eric P. Xing,et al.  Symbolic Graph Reasoning Meets Convolutions , 2018, NeurIPS.

[20]  Nanning Zheng,et al.  Person Re-identification by Multi-Channel Parts-Based CNN with Improved Triplet Loss Function , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Wenjun Zeng,et al.  Cross View Fusion for 3D Human Pose Estimation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Nanning Zheng,et al.  View Adaptive Recurrent Neural Networks for High Performance Human Action Recognition from Skeleton Data , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Dariu Gavrila,et al.  A Bayesian Framework for Multi-cue 3D Object Tracking , 2004, ECCV.

[24]  Pascal Fua,et al.  Multicamera People Tracking with a Probabilistic Occupancy Map , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Francesc Moreno-Noguer,et al.  3D Human Pose Estimation from a Single Image via Distance Matrix Regression , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Christian Szegedy,et al.  DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Wenjun Zeng,et al.  Densely Semantically Aligned Person Re-Identification , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Xiaohui Xie,et al.  Co-Occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks , 2016, AAAI.

[30]  Bernt Schiele,et al.  Multi-view Pictorial Structures for 3D Human Pose Estimation , 2013, BMVC.

[31]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[33]  Yichen Wei,et al.  Compositional Human Pose Regression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[35]  Yi Yang,et al.  Articulated pose estimation with flexible mixtures-of-parts , 2011, CVPR 2011.

[36]  Yi Wu,et al.  Online Object Tracking: A Benchmark , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  A. G. Amitha Perera,et al.  Multi-Object Tracking Through Simultaneous Long Occlusions and Split-Merge Conditions , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[38]  Lourdes Agapito,et al.  Rethinking Pose in 3D: Multi-stage Refinement and Recovery for Markerless Motion Capture , 2018, 2018 International Conference on 3D Vision (3DV).

[39]  Wenjun Zeng,et al.  Skeleton-Indexed Deep Multi-Modal Feature Learning for High Performance Human Action Recognition , 2018, 2018 IEEE International Conference on Multimedia and Expo (ICME).

[40]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[41]  Kaiming He,et al.  Mask R-CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[42]  Carlo Tomasi,et al.  Tracking Multiple People Online and in Real Time , 2014, ACCV.

[43]  Xiaoyan Sun,et al.  MiCT: Mixed 3D/2D Convolutional Tube for Human Action Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[45]  Xiong Chen,et al.  Learning Discriminative Features with Multiple Granularities for Person Re-Identification , 2018, ACM Multimedia.

[46]  Bernt Schiele,et al.  Multiple People Tracking by Lifted Multicut and Person Re-identification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Xin Jin,et al.  Semantics-Aligned Representation Learning for Person Re-identification , 2019, AAAI.

[48]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[49]  G. Johansson Visual perception of biological motion and a model for its analysis , 1973 .

[50]  Bernt Schiele,et al.  Subgraph decomposition for multi-target tracking , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Bernt Schiele,et al.  Learning People Detectors for Tracking in Crowded Scenes , 2013, 2013 IEEE International Conference on Computer Vision.

[52]  Chong Luo,et al.  A Twofold Siamese Network for Real-Time Object Tracking , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[53]  Zhiwei Xiong,et al.  SPM-Tracker: Series-Parallel Matching for Real-Time Visual Object Tracking , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Xiaogang Wang,et al.  Intelligent multi-camera video surveillance: A review , 2013, Pattern Recognit. Lett..

[55]  Gang Wang,et al.  A Siamese Long Short-Term Memory Architecture for Human Re-identification , 2016, ECCV.

[56]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[57]  Richard P. Wildes,et al.  Spatiotemporal Multiplier Networks for Video Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  James J. Little,et al.  A Simple Yet Effective Baseline for 3d Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[60]  Shiliang Zhang,et al.  Pose-Driven Deep Convolutional Model for Person Re-identification , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[61]  Rémi Ronfard,et al.  A survey of vision-based methods for action representation, segmentation and recognition , 2011, Comput. Vis. Image Underst..

[62]  Iasonas Kokkinos,et al.  DensePose: Dense Human Pose Estimation in the Wild , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.