Joint Analysis and Prediction of Human Actions and Paths in Video

With the advancement in computer vision deep learning, systems now are able to analyze an unprecedented amount of rich visual information from videos to enable applications such as autonomous driving, socially-aware robot assistant and public safety monitoring. Deciphering human behaviors to predict their future paths/trajectories and what they would do from videos is important in these applications. However, modern vision systems in self-driving applications usually perform the detection (perception) and prediction task in separate components, which leads to error propagation and sub-optimal performance. More importantly, these systems do not provide high-level semantic attributes to reason about pedestrian future. This design hinders prediction performance in video data from diverse domains and unseen scenarios. To enable optimal future human behavioral forecasting, it is crucial for the system to be able to detect and analyze human activities leading up to the prediction period, passing informative features to the subsequent prediction module for context understanding. In this thesis, with the goal of improving the performance and generalization ability of future trajectory and action prediction models, we conduct human action analysis and jointly optimize models for action detection, prediction and trajectory prediction.

[1]  Chen Sun,et al.  Stochastic Prediction of Multi-Agent Interactions from Partial Observations , 2019, ICLR.

[2]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[3]  Alan L. Yuille,et al.  Feature Denoising for Improving Adversarial Robustness , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Silvio Savarese,et al.  Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Jonathan G. Fiscus,et al.  TRECVID 2019: An evaluation campaign to benchmark Video Activity Detection, Video Captioning and Matching, and Video Search & retrieval , 2019, TRECVID.

[6]  Yunbo Wang,et al.  Eidetic 3D LSTM: A Model for Video Prediction and Beyond , 2019, ICLR.

[7]  Sergio Casas,et al.  PnPNet: End-to-End Perception and Prediction With Tracking in the Loop , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[9]  Sergey Levine,et al.  PRECOG: PREdiction Conditioned on Goals in Visual Multi-Agent Settings , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Yi Li,et al.  R-FCN: Object Detection via Region-based Fully Convolutional Networks , 2016, NIPS.

[11]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Yoichi Sato,et al.  Future Person Localization in First-Person Videos , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Xiaogang Wang,et al.  Learning from massive noisy labeled data for image classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Li Fei-Fei,et al.  MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels , 2017, ICML.

[15]  Jonathan G. Fiscus,et al.  TRECVID 2018: Benchmarking Video Activity Detection, Video Captioning and Matching, Video Storytelling Linking and Video Search , 2018, TRECVID.

[16]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[17]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[18]  Daphne Koller,et al.  Self-Paced Learning for Latent Variable Models , 2010, NIPS.

[19]  Jiande Sun,et al.  Informedia @ TRECVID 2016 , 2016, TRECVID.

[20]  Juan Carlos Niebles,et al.  Peeking Into the Future: Predicting Future Person Activities and Locations in Videos , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Nicholas Rhinehart,et al.  First-Person Activity Forecasting with Online Inverse Reinforcement Learning , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[22]  Cordelia Schmid,et al.  AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Cordelia Schmid,et al.  Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos , 2018, ArXiv.

[24]  Philip H. S. Torr,et al.  DESIRE: Distant Future Prediction in Dynamic Scenes with Interacting Agents , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Mayank Bansal,et al.  ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst , 2018, Robotics: Science and Systems.

[26]  Xinlei Chen,et al.  NEIL: Extracting Visual Knowledge from Web Data , 2013, 2013 IEEE International Conference on Computer Vision.

[27]  Song-Chun Zhu,et al.  Learning and Inferring “Dark Matter” and Predicting Human Intents and Trajectories in Videos , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Martial Hebert,et al.  Activity Forecasting , 2012, ECCV.

[29]  Stefan Lee,et al.  Embodied Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[30]  David A. Shamma,et al.  The New Data and New Challenges in Multimedia Research , 2015, ArXiv.

[31]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[32]  Kate Saenko,et al.  R-C3D: Region Convolutional 3D Network for Temporal Activity Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[33]  Trevor Darrell,et al.  Adversarial Discriminative Domain Adaptation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[35]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[36]  Yi Yang,et al.  Contrastive Adaptation Network for Unsupervised Domain Adaptation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Bernhard Schölkopf,et al.  Unifying distillation and privileged information , 2015, ICLR.

[38]  Bin Yang,et al.  Learning to Reweight Examples for Robust Deep Learning , 2018, ICML.

[39]  Joan Bruna,et al.  Training Convolutional Networks with Noisy Labels , 2014, ICLR 2014.

[40]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[41]  Sergio Casas,et al.  End-To-End Interpretable Neural Motion Planner , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Deyu Meng,et al.  Easy Samples First: Self-paced Reranking for Zero-Example Multimedia Search , 2014, ACM Multimedia.

[43]  Katta G. Murty,et al.  Nonlinear Programming Theory and Algorithms , 2007, Technometrics.

[44]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Qi Tian,et al.  Data-Free Learning of Student Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[46]  Ali Farhadi,et al.  Learning Everything about Anything: Webly-Supervised Visual Concept Learning , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[47]  George Papandreou,et al.  Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation , 2018, ECCV.

[48]  Hema Swetha Koppula,et al.  Anticipating Human Activities Using Object Affordances for Reactive Robotic Response , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  Silvio Savarese,et al.  Tracking the Untrackable: Learning to Track Multiple Cues with Long-Term Dependencies , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[50]  Ming-Hsuan Yang,et al.  Object Tracking Benchmark , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Juan Carlos Niebles,et al.  Disentangling Human Dynamics for Pedestrian Locomotion Forecasting with Noisy Supervision , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[52]  Afzal Godil,et al.  Summary of the 2019 Activity Detection in Extended Videos Prize Challenge , 2020, 2020 IEEE Winter Applications of Computer Vision Workshops (WACVW).

[53]  Lu Jiang,et al.  SimAug: Learning Robust Representations from Simulation for Trajectory Prediction , 2020, ECCV.

[54]  Kai Oliver Arras,et al.  People tracking with human motion predictions from social forces , 2010, 2010 IEEE International Conference on Robotics and Automation.

[55]  Valentin I. Spitkovsky,et al.  Baby Steps: How “Less is More” in Unsupervised Dependency Parsing , 2009 .

[56]  Sanja Fidler,et al.  Meta-Sim: Learning to Generate Synthetic Datasets , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[57]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[58]  Stefano Soatto,et al.  Intent-aware long-term prediction of pedestrian motion , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[59]  Arun Ross,et al.  Forecasting Pedestrian Trajectory with Machine-Annotated Training Data , 2019, 2019 IEEE Intelligent Vehicles Symposium (IV).

[60]  Henggang Cui,et al.  Uncertainty-aware Short-term Motion Prediction of Traffic Actors for Autonomous Driving , 2018, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[61]  Yi Yang,et al.  Fast and Accurate Content-based Semantic Search in 100M Internet Videos , 2015, ACM Multimedia.

[62]  Deyu Meng,et al.  Leveraging Multi-modal Prior Knowledge for Large-scale Concept Learning in Noisy Web Data , 2017, ICMR.

[63]  Rui Caseiro,et al.  High-Speed Tracking with Kernelized Correlation Filters , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[64]  Xiaogang Wang,et al.  Pedestrian Behavior Understanding and Prediction with Deep Neural Networks , 2016, ECCV.

[65]  Silvio Savarese,et al.  Deep Learning Under Privileged Information Using Heteroscedastic Dropout , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[66]  Simon Lucey,et al.  Argoverse: 3D Tracking and Forecasting With Rich Maps , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Kris M. Kitani,et al.  Forecasting Interactive Dynamics of Pedestrians with Fictitious Play , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Luc Van Gool,et al.  Improving Data Association by Joint Modeling of Pedestrian Trajectories and Groupings , 2010, ECCV.

[69]  Ashish Kapoor,et al.  AirSim: High-Fidelity Visual and Physical Simulation for Autonomous Vehicles , 2017, FSR.

[70]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[71]  Silvio Savarese,et al.  SoPhie: An Attentive GAN for Predicting Paths Compliant to Social and Physical Constraints , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Marc'Aurelio Ranzato,et al.  Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[73]  Larry S. Davis,et al.  AVSS 2011 demo session: A large-scale benchmark dataset for event recognition in surveillance video , 2011, AVSS.

[74]  Antonio Manuel López Peña,et al.  Procedural Generation of Videos to Train Deep Action Recognition Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[75]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[76]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[77]  Luca Anthony Thiede,et al.  Analyzing the Variety Loss in the Context of Probabilistic Trajectory Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[78]  Qin Jin,et al.  Generating Natural Video Descriptions via Multimodal Processing , 2016, INTERSPEECH.

[79]  Alexander Hauptmann,et al.  The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[80]  Fei-Fei Li,et al.  Shifting Weights: Adapting Object Detectors from Image to Video , 2012, NIPS.

[81]  Qin Jin,et al.  Video Description Generation using Audio and Visual Cues , 2016, ICMR.

[82]  Yi Zhang,et al.  UnrealCV: Virtual Worlds for Computer Vision , 2017, ACM Multimedia.

[83]  Dietrich Paulus,et al.  Simple online and realtime tracking with a deep association metric , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[84]  Silvio Savarese,et al.  Social LSTM: Human Trajectory Prediction in Crowded Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[85]  Silvio Savarese,et al.  Learning Social Etiquette: Human Trajectory Understanding In Crowded Scenes , 2016, ECCV.

[86]  Slawomir Bak,et al.  Domain Adaptation through Synthesis for Unsupervised Person Re-identification , 2018, ECCV.

[87]  Julien Pettré,et al.  Social Ways: Learning Multi-Modal Distributions of Pedestrian Trajectories With GANs , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[88]  Larry S. Davis,et al.  Temporal Context Network for Activity Localization in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[89]  Yichen Wei,et al.  Relation Networks for Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[90]  Daphne Koller,et al.  Learning specific-class segmentation from diverse data , 2011, 2011 International Conference on Computer Vision.

[91]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[92]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[93]  Andrew Zisserman,et al.  The AVA-Kinetics Localized Human Actions Video Dataset , 2020, ArXiv.

[94]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[95]  Tanaya Guha,et al.  Multiple Object Forecasting: Predicting Future Object Locations in Diverse Environments , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[96]  Bolei Zhou,et al.  Scene Parsing through ADE20K Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[97]  Antonio M. López,et al.  The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[98]  Ali Farhadi,et al.  Target-driven visual navigation in indoor scenes using deep reinforcement learning , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[99]  Liangliang Cao,et al.  Focal Visual-Text Attention for Visual Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[100]  Shenghua Gao,et al.  Encoding Crowd Interaction with Deep Neural Network for Pedestrian Trajectory Prediction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[101]  Alexander G. Hauptmann,et al.  Informedia @ TRECVID 2017 , 2017, TRECVID.

[102]  Paul Vernaza,et al.  r2p2: A ReparameteRized Pushforward Policy for Diverse, Precise Generative Path Forecasting , 2018, ECCV.

[103]  Yi Yang,et al.  Revisiting EmbodiedQA: A Simple Baseline and Beyond , 2019, IEEE Transactions on Image Processing.

[104]  Mark Reynolds,et al.  SS-LSTM: A Hierarchical LSTM Model for Pedestrian Trajectory Prediction , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[105]  Deva Ramanan,et al.  Self-Paced Learning for Long-Term Tracking , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[106]  Yuke Li,et al.  Which Way Are You Going? Imitative Decision Learning for Path Forecasting in Dynamic Scenes , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[107]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[108]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[109]  Benjamin Sapp,et al.  MultiPath: Multiple Probabilistic Anchor Trajectory Hypotheses for Behavior Prediction , 2019, CoRL.

[110]  Gregory D. Hager,et al.  RSA: Randomized Simulation as Augmentation for Robust Human Action Recognition , 2019, ArXiv.

[111]  Hema Swetha Koppula,et al.  Car that Knows Before You Do: Anticipating Maneuvers via Learning Temporal Driving Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[112]  Dan Boneh,et al.  Ensemble Adversarial Training: Attacks and Defenses , 2017, ICLR.

[113]  Ivor W. Tsang,et al.  Visual Event Recognition in Videos by Learning from Web Data , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[114]  Di Huang,et al.  Beyond Synthetic Noise: Deep Learning on Controlled Noisy Labels , 2020, ICML.

[115]  Juan Carlos Niebles,et al.  Graph Distillation for Action Detection with Privileged Modalities , 2017, ECCV.

[116]  Ying Nian Wu,et al.  Multi-Agent Tensor Fusion for Contextual Trajectory Prediction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[117]  Kathrin Klamroth,et al.  Biconvex sets and optimization with biconvex functions: a survey and extensions , 2007, Math. Methods Oper. Res..

[118]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[119]  Song-Chun Zhu,et al.  Joint inference of groups, events and human roles in aerial videos , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[120]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[121]  Bin Yang,et al.  Fast and Furious: Real Time End-to-End 3D Detection, Tracking and Motion Forecasting with a Single Convolutional Net , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[122]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[123]  John K. Tsotsos,et al.  PIE: A Large-Scale Dataset and Models for Pedestrian Intention Estimation and Trajectory Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[124]  Qiang Xu,et al.  nuScenes: A Multimodal Dataset for Autonomous Driving , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[125]  Silvio Savarese,et al.  Single-source Attention Path Prediction Multi-source Attention Predicted Observed , 2018 .

[126]  Germán Ros,et al.  CARLA: An Open Urban Driving Simulator , 2017, CoRL.

[127]  Xinlei Chen,et al.  Never-Ending Learning , 2012, ECAI.

[128]  Jonathan P. How,et al.  A Transferable Pedestrian Motion Prediction Model for Intersections with Different Geometries , 2018, ArXiv.

[129]  Alexander G. Hauptmann,et al.  Temporal localization of audio events for conflict monitoring in social media , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[130]  Silvio Savarese,et al.  Structural-RNN: Deep Learning on Spatio-Temporal Graphs , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[131]  Xinlei Chen,et al.  Webly Supervised Learning of Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[132]  Isaac L. Chuang,et al.  Confident Learning: Estimating Uncertainty in Dataset Labels , 2019, J. Artif. Intell. Res..

[133]  Deyu Meng,et al.  What Objective Does Self-paced Learning Indeed Optimize? , 2015, ArXiv.

[134]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[135]  Silvio Savarese,et al.  Understanding Collective Activitiesof People from Videos , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[136]  Stan Sclaroff,et al.  Learning Activity Progression in LSTMs for Activity Detection and Early Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[137]  Mohan M. Trivedi,et al.  Trajectory Forecasts in Unknown Environments Conditioned on Grid-Based Plans , 2020, ArXiv.

[138]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[139]  Dariu Gavrila,et al.  Context-Based Pedestrian Path Prediction , 2014, ECCV.

[140]  Chenxi Liu,et al.  Adversarial Attacks Beyond the Image Space , 2017, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[141]  Yong Cheng,et al.  Robust Neural Machine Translation with Doubly Adversarial Inputs , 2019, ACL.

[142]  Benjamin Sapp,et al.  Rules of the Road: Predicting Driving Behavior With a Convolutional Model of Semantic Interactions , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[143]  Rui Hou,et al.  Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[144]  Liangliang Cao,et al.  Focal Visual-Text Attention for Visual Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[145]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[146]  Alexander G. Hauptmann,et al.  Webly-Supervised Learning of Multimodal Video Detectors , 2017, AAAI.

[147]  Cordelia Schmid,et al.  Synthetic Humans for Action Recognition from Unseen Viewpoints , 2019, ArXiv.

[148]  Rauf Izmailov,et al.  Learning using privileged information: similarity control and knowledge transfer , 2015, J. Mach. Learn. Res..

[149]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[150]  Peng Chen,et al.  Argus: Efficient Activity Detection System for Extended Video Analysis , 2020, 2020 IEEE Winter Applications of Computer Vision Workshops (WACVW).

[151]  R. Smith,et al.  An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[152]  Thomas Brox,et al.  Overcoming Limitations of Mixture Density Networks: A Sampling and Fitting Framework for Multimodal Future Prediction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[153]  Daniel Jurafsky,et al.  A Simple, Fast Diverse Decoding Algorithm for Neural Generation , 2016, ArXiv.

[154]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[155]  Graham M. Gibson,et al.  A fast 3D reconstruction system with a low-cost camera accessory , 2015, Scientific Reports.

[156]  Dani Lischinski,et al.  Crowds by Example , 2007, Comput. Graph. Forum.

[157]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[158]  Larry S. Davis,et al.  Fast Automatic Video Retrieval using Web Images , 2015, ArXiv.

[159]  Cewu Lu,et al.  RMPE: Regional Multi-person Pose Estimation , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[160]  Deyu Meng,et al.  Learning to Detect Concepts from Webly-Labeled Video Data , 2016, IJCAI.

[161]  Kris Kitani,et al.  End-to-End 3D Multi-Object Tracking and Trajectory Forecasting , 2020, ArXiv.

[162]  Yuval Tassa,et al.  Emergence of Locomotion Behaviours in Rich Environments , 2017, ArXiv.

[163]  Jun Zhu,et al.  Understanding Human Behaviors in Crowds by Imitating the Decision-Making Process , 2018, AAAI.

[164]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[165]  Samy Bengio,et al.  Adversarial examples in the physical world , 2016, ICLR.

[166]  Nanning Zheng,et al.  SR-LSTM: State Refinement for LSTM Towards Pedestrian Trajectory Prediction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[167]  John K. Tsotsos,et al.  Are They Going to Cross? A Benchmark Dataset and Baseline for Pedestrian Crosswalk Behavior , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[168]  Song-Chun Zhu,et al.  CERN: Confidence-Energy Recurrent Network for Group Activity Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[169]  Lars Petersson,et al.  Encouraging LSTMs to Anticipate Actions Very Early , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[170]  Ruslan Salakhutdinov,et al.  Multiple Futures Prediction , 2019, NeurIPS.

[171]  Stefan Roth,et al.  Neural Nearest Neighbors Networks , 2018, NeurIPS.

[172]  Yee Whye Teh,et al.  Stacked Capsule Autoencoders , 2019, NeurIPS.

[173]  Dit-Yan Yeung,et al.  Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[174]  Dariu M. Gavrila,et al.  Human motion trajectory prediction: a survey , 2019, Int. J. Robotics Res..

[175]  Shih-Fu Chang,et al.  Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[176]  Gita Alaghband,et al.  Scene-LSTM: A Model for Human Trajectory Prediction , 2018, ArXiv.

[177]  Dumitru Erhan,et al.  Unsupervised Pixel-Level Domain Adaptation with Generative Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[178]  Vladlen Koltun,et al.  Playing for Data: Ground Truth from Computer Games , 2016, ECCV.

[179]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[180]  Xin Pan,et al.  YouTube-BoundingBoxes: A Large High-Precision Human-Annotated Data Set for Object Detection in Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[181]  Shiguang Shan,et al.  Self-Paced Learning with Diversity , 2014, NIPS.

[182]  Sergio Casas,et al.  IntentNet: Learning to Predict Intention from Raw Sensor Data , 2018, CoRL.

[183]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[184]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[185]  Manmohan Krishna Chandraker,et al.  Learning To Simulate , 2018, ICLR.

[186]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[187]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[188]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[189]  Alexander Hauptmann,et al.  MMVG-INF-Etrol@TRECVID 2019: Activities in Extended Video , 2019, TRECVID.

[190]  Xirong Li,et al.  Semantic Concept Annotation of Consumer Videos at Frame-Level Using Audio , 2014, PCM.

[191]  Yunchao Wei,et al.  Towards Computational Baby Learning: A Weakly-Supervised Approach for Object Detection , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[192]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[193]  T. Başar,et al.  A New Approach to Linear Filtering and Prediction Problems , 2001 .

[194]  Jun-Cheng Chen,et al.  A Proposal-Based Solution to Spatio-Temporal Action Detection in Untrimmed Videos , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[195]  Jia Chen,et al.  An Event Reconstruction Tool for Conflict Monitoring Using Social Media , 2017, AAAI.