Online Human Interaction Detection and Recognition With Multiple Cameras

We address the problem of detecting and recognizing online the occurrence of human interactions as seen by a network of multiple cameras. We represent interactions by forming temporal trajectories, coupling together the body motion of each individual and their proximity relationships with others, and also sound whenever available. Such trajectories are modeled with kernel state-space (KSS) models. Their advantage is being suitable for the online interaction detection, recognition, and also for fusing information from multiple cameras, while enabling a fast implementation based on online recursive updates. For recognition, in order to compare interaction trajectories in the space of KSS models, we design so-called pairwise kernels with a special symmetry. For detection, we exploit the geometry of linear operators in Hilbert space, and extend to KSS models the concept of parity space, originally defined for linear models. For fusion, we combine KSS models with kernel construction and multiview learning techniques. We extensively evaluate the approach on four single view publicly available data sets, and we also introduce, and will make public, a new challenging human interactions data set that we have collected using a network of three cameras. The results show that the approach holds promise to become an effective building block for the analysis of real-time human behavior from multiple cameras.

[1]  Nuno Vasconcelos,et al.  Recognizing Activities by Attribute Dynamics , 2012, NIPS.

[2]  J. Shawe-Taylor,et al.  Multi-View Canonical Correlation Analysis , 2010 .

[3]  Bart De Moor,et al.  Subspace angles between ARMA models , 2002, Syst. Control. Lett..

[4]  Bernhard Schölkopf,et al.  A Kernel Method for the Two-Sample-Problem , 2006, NIPS.

[5]  Silvio Savarese,et al.  Understanding Collective Activitiesof People from Videos , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[7]  Nuno Vasconcelos,et al.  Classifying Video with Kernel Dynamic Textures , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Alex Simpkins,et al.  System Identification: Theory for the User, 2nd Edition (Ljung, L.; 1999) [On the Shelf] , 2012, IEEE Robotics & Automation Magazine.

[9]  Gang Yu,et al.  Propagative Hough Voting for Human Activity Recognition , 2012, ECCV.

[10]  Rémi Ronfard,et al.  Action Recognition from Arbitrary Views using 3D Exemplars , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[11]  Yihong Gong,et al.  Action detection in complex scenes with spatial and temporal ambiguities , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[12]  William Brendel,et al.  Learning spatiotemporal graphs of human activities , 2011, 2011 International Conference on Computer Vision.

[13]  William Stafford Noble,et al.  Kernel methods for predicting protein-protein interactions , 2005, ISMB.

[14]  Yang Wang,et al.  Discriminative Latent Models for Recognizing Contextual Group Activities , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Stefan Roth,et al.  People-tracking-by-detection and people-detection-by-tracking , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Peter H. Tu,et al.  Appearance-based person reidentification in camera networks: problem overview and current approaches , 2011, J. Ambient Intell. Humaniz. Comput..

[17]  Jake K. Aggarwal,et al.  Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[18]  Anuj Srivastava,et al.  Riemannian Analysis of Probability Density Functions with Applications in Vision , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Deva Ramanan,et al.  Efficiently Scaling up Crowdsourced Video Annotation , 2012, International Journal of Computer Vision.

[20]  Yunde Jia,et al.  Interactive Phrases: Semantic Descriptionsfor Human Interaction Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Johan A. K. Suykens,et al.  Efficiently updating and tracking the dominant kernel principal components , 2007, Neural Networks.

[22]  Alexander J. Smola,et al.  Binet-Cauchy Kernels on Dynamical Systems and its Application to the Analysis of Dynamic Scenes , 2007, International Journal of Computer Vision.

[23]  Martial Hebert,et al.  Modeling the Temporal Extent of Actions , 2010, ECCV.

[24]  R. Vidal,et al.  Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Jessica K. Hodgins,et al.  Hierarchical Aligned Cluster Analysis for Temporal Clustering of Human Motion , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Zaïd Harchaoui,et al.  Signal Processing , 2013, 2020 27th International Conference on Mixed Design of Integrated Circuits and System (MIXDES).

[27]  Jernej Barbic,et al.  Segmenting Motion Capture Data into Distinct Behaviors , 2004, Graphics Interface.

[28]  Stefano Soatto,et al.  Dynamic Textures , 2003, International Journal of Computer Vision.

[29]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[30]  Yun Fu,et al.  A Discriminative Model with Multiple Temporal Scales for Action Prediction , 2014, ECCV.

[31]  Rafael Muñoz-Salinas,et al.  Human interaction categorization by using audio-visual cues , 2013, Machine Vision and Applications.

[32]  Nuno Vasconcelos,et al.  Probabilistic kernels for the classification of auto-regressive visual processes , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[33]  Dimitris Samaras,et al.  Two-person interaction detection using body-pose features and multiple instance learning , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[34]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[35]  David A. Forsyth,et al.  Video Event Detection: From Subvolume Localization to Spatiotemporal Path Search , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[37]  Gérard G. Medioni,et al.  Kernelized Temporal Cut for Online Temporal Segmentation and Recognition , 2012, ECCV.

[38]  James M. Rehg,et al.  Learning and Inferring Motion Patterns using Parametric Segmental Switching Linear Dynamic Systems , 2008, International Journal of Computer Vision.

[39]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[40]  Fernando De la Torre,et al.  Max-Margin Early Event Detectors , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Saeid Motiian,et al.  Pairwise Kernels for Human Interaction Recognition , 2013, ISVC.

[42]  Sven J. Dickinson,et al.  Recognize Human Activities from Partially Observed Videos , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  A. Willsky,et al.  Analytical redundancy and the design of robust failure detection systems , 1984 .

[44]  Kris M. Kitani,et al.  Action-Reaction: Forecasting the Dynamics of Human Interaction , 2014, ECCV.

[45]  Andreas Fischer,et al.  Pairwise support vector machines and their application to large scale problems , 2012, J. Mach. Learn. Res..

[46]  Ying Wu,et al.  Discriminative Video Pattern Search for Efficient Action Detection , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Mubarak Shah,et al.  Spatiotemporal Deformable Part Models for Action Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[49]  Tie-Yan Liu,et al.  Learning to rank: from pairwise approach to listwise approach , 2007, ICML '07.

[50]  Ian D. Reid,et al.  Structured Learning of Human Interactions in TV Shows , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Jing Xiao,et al.  Substructure and boundary modeling for continuous action recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[52]  Peter H. Tu,et al.  A model change detection approach to dynamic scene modeling , 2009, 2009 Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance.

[53]  Saeid Motiian,et al.  Online geometric human interaction segmentation and recognition , 2014, 2014 IEEE International Conference on Multimedia and Expo (ICME).

[54]  Patrick Bouthemy,et al.  Action Localization with Tubelets from Motion , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[55]  A. Atiya,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[56]  Yunde Jia,et al.  Learning Human Interaction by Interactive Phrases , 2012, ECCV.

[57]  Amit K. Roy-Chowdhury,et al.  A “string of feature graphs” model for recognition of complex activities in natural videos , 2011, 2011 International Conference on Computer Vision.