Towards accurate group activity analysis in videos: robust saliency detection and effective feature modeling

Human activity analysis is an important area of computer vision research today. The goal of human activity analysis is to automatically analyze ongoing activities from an unknown video. The ability to analyze complex human activities from videos has many important applications, such as smart camera system, video surveillance, etc. However, it is still far from an off-the-shelf system. There are many challenging problems and it is still an active research area. This dissertation focuses on addressing two problems: various camera motions and effective modeling of group behaviors. We propose a unified and robust framework to detect salient motions from diverse types of videos. Given a video sequence that is recorded from either a stationary or moving camera, our algorithm is able to detect the salient motion regions. The model is inspired by two observations: 1) background motion caused by orthographic cameras lies in a low rank subspace, and 2) pixels belonging to one trajectory tend to group together. Based on these two observations, we introduce a new model using both low rank and group sparsity constraints. It is able to robustly decompose a motion trajectory matrix into foreground and background ones. Extensive experiments demonstrate very competitive performance on both synthetic data and real videos. After salient motion detection, a new method is proposed to model group behaviors in video sequences. This approach effectively models group activities based on social behavior analysis. Different from previous work that uses independent local features, our method explores the relationships between the current behavior state of a subject and its actions. An interaction energy potential function is proposed to represent the current behavior state of a subject, and velocity is used as its actions. Our method does not depend on human detection, so it is robust to detection errors. Instead, tracked salient points are able to provide a good estimation of modeling group interaction. We evaluate our algorithm in two datasets: UMN and BEHAVE. Experimental results show its promising performance against the state-of-art methods.

[1]  Ramesh C. Jain,et al.  On the Analysis of Accumulative Difference Pictures from Image Sequences of Real World Scenes , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[3]  Christopher G. Harris,et al.  A Combined Corner and Edge Detector , 1988, Alvey Vision Conference.

[4]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[5]  Helbing,et al.  Social force model for pedestrian dynamics. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[6]  Alex Pentland,et al.  Pfinder: real-time tracking of the human body , 1996, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition.

[7]  Stuart J. Russell,et al.  Image Segmentation in Video Sequences: A Probabilistic Approach , 1997, UAI.

[8]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[9]  W. Eric L. Grimson,et al.  Learning Patterns of Activity Using Real-Time Tracking , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Larry S. Davis,et al.  W4: Real-Time Surveillance of People and Their Activities , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  H. Barlow Vision Science: Photons to Phenomenology by Stephen E. Palmer , 2000, Trends in Cognitive Sciences.

[12]  Daniel P. Huttenlocher,et al.  Scene modeling for wide area surveillance and image synthesis , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[13]  Ramakant Nevatia,et al.  Event Detection and Analysis from Video Streams , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  C. Koch,et al.  Computational modelling of visual attention , 2001, Nature Reviews Neuroscience.

[15]  Olga Veksler,et al.  Fast Approximate Energy Minimization via Graph Cuts , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  I. Haritaoglu,et al.  Background and foreground modeling using nonparametric kernel density estimation for visual surveillance , 2002 .

[17]  Jan-Olof Eklundh,et al.  Statistical background subtraction for a mobile observer , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[18]  Nikos Paragios,et al.  Background modeling and subtraction of dynamic scenes , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[19]  Barbara Caputo,et al.  Recognition with local features: the kernel recipe , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[20]  Stan Sclaroff,et al.  Segmenting foreground objects from a dynamic textured background via a robust Kalman filter , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[21]  Chin-Seng Chua,et al.  Statistical background modeling for non-stationary camera , 2003, Pattern Recognit. Lett..

[22]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[23]  Massimo Piccardi,et al.  Background subtraction techniques: a review , 2004, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583).

[24]  Thomas Brox,et al.  High Accuracy Optical Flow Estimation Based on a Theory for Warping , 2004, ECCV.

[25]  B. Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[26]  Takeo Kanade,et al.  Shape and motion from image streams under orthography: a factorization method , 1992, International Journal of Computer Vision.

[27]  Lixin Fan,et al.  Categorizing Nine Visual Classes using Local Appearance Descriptors , 2004 .

[28]  Max Lu,et al.  Robust and efficient foreground analysis for real-time video surveillance , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[29]  S. Lazebnik,et al.  Local Features and Kernels for Classification of Texture and Object Categories: An In-Depth Study , 2005 .

[30]  Cordelia Schmid,et al.  A Comparison of Affine Region Detectors , 2005, International Journal of Computer Vision.

[31]  Yaser Sheikh,et al.  Bayesian object detection in dynamic scenes , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[32]  Michael Elad,et al.  Submitted to Ieee Transactions on Image Processing Image Decomposition via the Combination of Sparse Representations and a Variational Approach , 2022 .

[33]  Michel Bierlaire,et al.  Behavioral Priors for Detection and Tracking of Pedestrians in Video Sequences , 2006, International Journal of Computer Vision.

[34]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[35]  S. Shankar Sastry,et al.  Generalized principal component analysis (GPCA) , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories , 2006 .

[37]  Robert B. Fisher,et al.  Modelling Crowd Scenes for Event Detection , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[38]  Emmanuel J. Candès,et al.  Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information , 2004, IEEE Transactions on Information Theory.

[39]  Marc Pollefeys,et al.  A General Framework for Motion Segmentation: Independent, Articulated, Rigid, Non-rigid, Degenerate and Non-degenerate , 2006, ECCV.

[40]  David L Donoho,et al.  Compressed sensing , 2006, IEEE Transactions on Information Theory.

[41]  Emmanuel J. Candès,et al.  Near-Optimal Signal Recovery From Random Projections: Universal Encoding Strategies? , 2004, IEEE Transactions on Information Theory.

[42]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[43]  Kevin Smith,et al.  Detecting Abandoned Luggage Items in a Public Space , 2006 .

[44]  Seth J. Teller,et al.  Particle Video: Long-Range Motion Estimation Using Point Trajectories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[45]  Adrien Treuille,et al.  Continuum crowds , 2006, ACM Trans. Graph..

[46]  René Vidal,et al.  A Benchmark for the Comparison of 3-D Motion Segmentation Algorithms , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[47]  Wen Gao,et al.  3D Haar-Like Features for Pedestrian Detection , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[48]  Qingshan Liu,et al.  Facial expression recognition using encoded dynamic features , 2007, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Mário A. T. Figueiredo,et al.  Gradient Projection for Sparse Reconstruction: Application to Compressed Sensing and Other Inverse Problems , 2007, IEEE Journal of Selected Topics in Signal Processing.

[50]  Stephen P. Boyd,et al.  An Interior-Point Method for Large-Scale $\ell_1$-Regularized Least Squares , 2007, IEEE Journal of Selected Topics in Signal Processing.

[51]  Liqing Zhang,et al.  Saliency Detection: A Spectral Residual Approach , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[52]  Gérard G. Medioni,et al.  Detecting Motion Regions in the Presence of a Strong Parallax from a Moving Camera by Multiview Geometric Constraints , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[53]  Wenjian Yu,et al.  Modeling crowd turbulence by many-particle simulations. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[54]  James M. Rehg,et al.  A Scalable Approach to Activity Recognition based on Object Use , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[55]  Sebastian Nowozin,et al.  Discriminative Subsequence Mining for Action Classification , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[56]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[57]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[58]  Edward H. Adelson,et al.  Human-assisted motion annotation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  Simone Calderara,et al.  Using circular statistics for trajectory shape analysis , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[60]  Shuicheng Yan,et al.  Pair-activity classification by bi-trajectories analysis , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[61]  Mubarak Shah,et al.  Recognizing human actions using multiple features , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[62]  Nuno Vasconcelos,et al.  Background subtraction in highly dynamic scenes , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[63]  Cor J. Veenman,et al.  Kernel Codebooks for Scene Categorization , 2008, ECCV.

[64]  Ahmed M. Elgammal,et al.  Human activity recognition from frame’s spatiotemporal representation , 2008, 2008 19th International Conference on Pattern Recognition.

[65]  Mubarak Shah,et al.  Floor Fields for Tracking in High Density Crowd Scenes , 2008, ECCV.

[66]  Christopher Joseph Pal,et al.  Activity recognition using the velocity histories of tracked keypoints , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[67]  Emmanuel J. Candès,et al.  Exact Matrix Completion via Convex Optimization , 2009, Found. Comput. Math..

[68]  Minglun Gong,et al.  Realtime background subtraction from dynamic scenes , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[69]  Pinar Duygulu Sahin,et al.  Histogram of oriented rectangles: A new pose descriptor for human action recognition , 2009, Image Vis. Comput..

[70]  Ting Yu,et al.  Monitoring, recognizing and discovering social networks , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[71]  Wen Gao,et al.  Contour-motion feature (CMF): A space-time approach for robust pedestrian detection , 2009, Pattern Recognit. Lett..

[72]  Ce Liu,et al.  Exploring new representations and applications for motion analysis , 2009 .

[73]  S. Kollias,et al.  Dense saliency-based spatiotemporal feature points for action recognition , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[74]  Luc Van Gool,et al.  You'll never walk alone: Modeling social behavior for multi-target tracking , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[75]  Junzhou Huang,et al.  Learning with dynamic group sparsity , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[76]  Bingbing Ni,et al.  Recognizing human group activities with localized causalities , 2009, CVPR 2009.

[77]  Ramin Mehran,et al.  Abnormal crowd behavior detection using social force model , 2009, CVPR.

[78]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[79]  Takeo Kanade,et al.  Background Subtraction for Freely Moving Cameras , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[80]  Qingshan Liu,et al.  Temporal spectral residual: fast motion saliency detection , 2009, ACM Multimedia.

[81]  Anthony Hoogs,et al.  Functional scene element recognition for video scene analysis , 2009, 2009 Workshop on Motion and Video Computing (WMVC).

[82]  Janusz Konrad,et al.  CHAPTER 3 – Motion Detection and Estimation , 2009 .

[83]  Junzhou Huang,et al.  The Benefit of Group Sparsity , 2009 .

[84]  Marshall F. Tappen,et al.  Learning pedestrian dynamics from the real world , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[85]  Qian Yu,et al.  Motion pattern interpretation and detection for tracking moving vehicles in airborne video , 2009, CVPR.

[86]  Matti Pietikäinen,et al.  Modeling pixel process with scale invariant local patterns for background subtraction in complex scenes , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[87]  Yang Yu,et al.  Automatic image annotation using group sparsity , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[88]  Mubarak Shah,et al.  Chaotic invariants of Lagrangian particle trajectories for anomaly detection in crowded scenes , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[89]  Kurt Keutzer,et al.  Dense Point Trajectories by GPU-Accelerated Large Displacement Optical Flow , 2010, ECCV.

[90]  Deborah Estrin,et al.  Warping background subtraction , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[91]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[92]  Thomas S. Huang,et al.  Image Super-Resolution Via Sparse Representation , 2010, IEEE Transactions on Image Processing.

[93]  Junzhou Huang,et al.  Efficient MR image reconstruction for compressed MR imaging , 2011, Medical Image Anal..

[94]  Benjamin Höferlin,et al.  Evaluation of background subtraction techniques for video surveillance , 2011, CVPR 2011.

[95]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[96]  Jian Dong,et al.  Accelerated low-rank visual recovery by random projection , 2011, CVPR 2011.

[97]  Yi Ma,et al.  Unwrapping low-rank textures on generalized cylindrical surfaces , 2011, 2011 International Conference on Computer Vision.

[98]  Julien Mairal,et al.  Structured sparsity through convex optimization , 2011, ArXiv.

[99]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[100]  Vladimir Pavlovic,et al.  Isotonic CCA for sequence alignment and activity recognition , 2011, 2011 International Conference on Computer Vision.

[101]  Junzhou Huang,et al.  Sparse shape composition: A new framework for shape prior modeling , 2011, CVPR 2011.

[102]  Qingshan Liu,et al.  Abnormal detection using interaction energy potentials , 2011, CVPR 2011.

[103]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[104]  Qi Tian,et al.  Image classification by non-negative sparse coding, low-rank and sparse decomposition , 2011, CVPR 2011.

[105]  Fei Yang,et al.  Temporal Spectral Residual for fast salient motion detection , 2012, Neurocomputing.

[106]  Junzhou Huang,et al.  Towards robust and effective shape modeling: Sparse shape composition , 2012, Medical Image Anal..

[107]  Thomas S. Huang,et al.  Coupled Dictionary Training for Image Super-Resolution , 2012, IEEE Transactions on Image Processing.

[108]  Junzhou Huang,et al.  Left endocardium segmentation using spatio-temporal Metamorphs , 2012, 2012 9th IEEE International Symposium on Biomedical Imaging (ISBI).

[109]  Junzhou Huang,et al.  Background Subtraction Using Low Rank and Group Sparsity Constraints , 2012, ECCV.

[110]  Dong Xu,et al.  Human Gait Recognition Using Patch Distribution Feature and Locality-Constrained Group Sparse Representation , 2012, IEEE Transactions on Image Processing.

[111]  Junzhou Huang,et al.  Automatic Image Annotation and Retrieval Using Group Sparsity , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[112]  Dimitris N. Metaxas,et al.  Deformable segmentation via sparse representation and dictionary learning , 2012, Medical Image Anal..

[113]  Ali Borji,et al.  State-of-the-Art in Visual Attention Modeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.