Event Detection in Continuous Video: An Inference in Point Process Approach

We propose a novel approach toward event detection in real-world continuous video sequences. The method: 1) is able to model arbitrary-order non-Markovian dependences in videos to mitigate local visual ambiguities; 2) conducts simultaneous event segmentation and labeling; and 3) is time-window free. The idea is to represent a video as an event stream of both high-level semantic events and low-level video observations. In training, we learn a point process model called a piecewise-constant conditional intensity model (PCIM) that is able to capture complex non-Markovian dependences in the event streams. In testing, event detection can be modeled as the inference of high-level semantic events, given low-level image observations. We develop the first inference algorithm for PCIM and show it samples exactly from the posterior distribution. We then evaluate the video event detection task on real-world video sequences. Our model not only provides competitive results on the video event segmentation and labeling task, but also provides benefits, including being interpretable and efficient.

[1]  Winfried K. Grassmann Transient solutions in markovian queueing systems , 1977, Comput. Oper. Res..

[2]  Sharath Pankanti,et al.  Temporal Sequence Modeling for Video Event Detection , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Fei-Fei Li,et al.  Learning latent temporal structure for complex event detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  N. Nikolaidis,et al.  Video shot detection and condensed representation. a review , 2006, IEEE Signal Processing Magazine.

[5]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[6]  Amit K. Roy-Chowdhury,et al.  Continuous Learning of Human Activity Models Using Deep Nets , 2014, ECCV.

[7]  Nir Friedman,et al.  Mean Field Variational Approximation for Continuous-Time Bayesian Networks , 2009, J. Mach. Learn. Res..

[8]  Ankur Parikh,et al.  Conjoint Modeling of Temporal Dependencies in Event Streams , 2012, BMA.

[9]  Mohamed R. Amer,et al.  Sum-product networks for modeling activities with stochastic structure , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Hongbo Deng,et al.  Identifying and labeling search tasks via query-based hawkes processes , 2014, KDD.

[11]  Thore Graepel,et al.  Poisson-Networks: A Model for Structured Poisson Processes. , 2005 .

[12]  Xin Wang,et al.  Modeling transition patterns between events for temporal human action segmentation and classification , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[13]  Jing Xu,et al.  Importance Sampling for Continuous Time Bayesian Networks , 2010, J. Mach. Learn. Res..

[14]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[15]  Fernando De la Torre,et al.  Joint segmentation and classification of human actions in video , 2011, CVPR 2011.

[16]  Andrew B. Whinston,et al.  Path to Purchase: A Mutually Exciting Point Process Model for Online Advertising and Conversion , 2012, Manag. Sci..

[17]  Amit K. Roy-Chowdhury,et al.  Context-Aware Activity Modeling Using Hierarchical Conditional Random Fields , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Daphne Koller,et al.  Continuous Time Bayesian Networks , 2012, UAI.

[19]  Christian R. Shelton,et al.  Deterministic Anytime Inference for Stochastic Continuous-Time Markov Processes , 2014, ICML.

[20]  Li Fei-Fei,et al.  End-to-End Learning of Action Detection from Frame Glimpses in Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Gregory D. Hager,et al.  Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions , 2009, CVPR.

[22]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[23]  G. Shedler,et al.  Simulation of Nonhomogeneous Poisson Processes by Thinning , 1979 .

[24]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[25]  Kejun Wang,et al.  Video-Based Abnormal Human Behavior Recognition—A Review , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[26]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Yee Whye Teh,et al.  Fast MCMC sampling for Markov jump processes and extensions , 2012, J. Mach. Learn. Res..

[28]  Benjamin Z. Yao,et al.  Unsupervised learning of event AND-OR grammar and semantics from video , 2011, 2011 International Conference on Computer Vision.

[29]  Xiaohui Xie,et al.  Co-Occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks , 2016, AAAI.

[30]  Sharath Pankanti,et al.  Spatio-temporal fisher vector coding for surveillance event detection , 2013, ACM Multimedia.

[31]  Darren J Wilkinson,et al.  Bayesian parameter inference for stochastic biochemical network models using particle Markov chain Monte Carlo , 2011, Interface Focus.

[32]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[33]  David Madigan,et al.  Probabilistic Temporal Reasoning , 2005, Handbook of Temporal Reasoning in Artificial Intelligence.

[34]  Meng Wang,et al.  Multimodal Deep Autoencoder for Human Pose Recovery , 2015, IEEE Transactions on Image Processing.

[35]  Mubarak Shah,et al.  Recognition of Complex Events: Exploiting Temporal Dynamics between Underlying Concepts , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  David Page,et al.  Forest-Based Point Process for Event Prediction from Electronic Health Records , 2013, ECML/PKDD.

[37]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[38]  Le Song,et al.  Scalable Influence Estimation in Continuous-Time Diffusion Networks , 2013, NIPS.

[39]  Silvio Savarese,et al.  Action Recognition by Hierarchical Mid-Level Action Elements , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[40]  Yu Fan,et al.  Learning Continuous-Time Social Network Dynamics , 2009, UAI.

[41]  Fei Gao,et al.  Deep Multimodal Distance Metric Learning Using Click Constraints for Image Ranking , 2017, IEEE Transactions on Cybernetics.

[42]  Yu Qiao,et al.  Action Recognition with Stacked Fisher Vectors , 2014, ECCV.

[43]  Yee Whye Teh,et al.  Fast MCMC sampling for Markov jump processes and continuous time Bayesian networks , 2011, UAI.

[44]  Nir Friedman,et al.  Continuous-Time Belief Propagation , 2010, ICML.

[45]  Jiaying Liu,et al.  PKU-MMD: A Large Scale Benchmark for Continuous Multi-Modal Human Action Understanding , 2017, ArXiv.

[46]  A. Hawkes Spectra of some self-exciting and mutually exciting point processes , 1971 .

[47]  Puyang Xu,et al.  A Model for Temporal Dependencies in Event Streams , 2011, NIPS.

[48]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Zhen Qin,et al.  Auxiliary Gibbs Sampling for Inference in Piecewise-Constant Conditional Intensity Models , 2015, UAI.