Inferring Ongoing Human Activities Based on Recurrent Self-Organizing Map Trajectory

Automatically inferring ongoing activities is to enable the early recognition of unfinished activities, which is quite meaningful for applications, such as online human-machine interaction and security monitoring. Stateof-the-art methods use the spatio-temporal interest point (STIP) based features as the low-level video description to handle complex scenes [1, 2, 3]. While the existing problem is that typical bag-of-visual words (BoVW) focuses on feature distribution but ignores the inherent contexts in sequences, resulting in low discrimination when directly dealing with limited observations. To solve this problem, the Recurrent SelfOrganizing Map (RSOM) [4], which was designed to process sequential data, is novelly adopted in this paper for the high-level representation of ongoing activities. The innovation lies that observed features and their spatio-temporal contexts are encoded in a trajectory of the pre-trained RSOM units. Additionally, a combination of Dynamic Time Warping (DTW) distance and Edit distance, named DTW-E, is specially proposed to measure the structural dissimilarity between RSOM trajectories. RSOM Trajectory: Since the RSOM constitutes a direct extension of SOM, we start from SOM. SOM is to map the data from an input space VI onto a lower dimensional space VL (a map) in such way that the topological relationships in VI are preserved and the SOM units approximate closely the probability density function of VI . Suppose each unit i in SOM is associated with a weight vector wi = [wi1,wi2, ...,win] ∈ Rn with the same dimension as the input vector x = [x1,x2, ...,xn] ∈ Rn. Learning process that leads to self-organization on a map can be summarized as, (i) The feature vector x(t) is input, then its best matching unit (bmu) on the map is found by computing the minimum distance as:

[1]  Eamonn J. Keogh,et al.  Scaling up dynamic time warping for datamining applications , 2000, KDD '00.

[2]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[3]  Juan Carlos Niebles,et al.  Spatial-Temporal correlatons for unsupervised action classification , 2008, 2008 IEEE Workshop on Motion and video Computing.

[4]  Hong Liu,et al.  Learning spatio-temporal co-occurrence correlograms for efficient human action classification , 2013, 2013 IEEE International Conference on Image Processing.

[5]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[6]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[7]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[8]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[9]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[10]  Jukka Heikkonen,et al.  A Recurrent Self-Organizing Map for Temporal Sequence Processing , 1997, ICANN.

[11]  Ling Shao,et al.  Feature detector and descriptor evaluation in human action recognition , 2010, CIVR '10.

[12]  Dimitrios Gunopulos,et al.  Discovering similar multidimensional trajectories , 2002, Proceedings 18th International Conference on Data Engineering.

[13]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[14]  L Leinonen,et al.  Dysphonia detected by pattern recognition of spectral composition. , 1992, Journal of speech and hearing research.

[15]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[16]  Jake K. Aggarwal,et al.  An Overview of Contest on Semantic Description of Human Activities (SDHA) 2010 , 2010, ICPR Contests.

[17]  R. Vidal,et al.  Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  M. V. Velzen,et al.  Self-organizing maps , 2007 .

[20]  Dan Schonfeld,et al.  Segmented trajectory based indexing and retrieval of video data , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[21]  Michael S. Ryoo,et al.  Human activity prediction: Early recognition of ongoing activities from streaming videos , 2011, 2011 International Conference on Computer Vision.

[22]  Tieniu Tan,et al.  Comparison of Similarity Measures for Trajectory Clustering in Outdoor Surveillance Scenes , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[23]  Christopher Joseph Pal,et al.  Activity recognition using the velocity histories of tracked keypoints , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[24]  Hichem Sahbi,et al.  Mid-level features and spatio-temporal context for activity recognition , 2012, Pattern Recognit..

[25]  Hong Liu,et al.  Action Disambiguation Analysis Using Normalized Google-Like Distance Correlogram , 2012, ACCV.

[26]  Fernando De la Torre,et al.  Max-margin early event detectors , 2012, CVPR.

[27]  Ramakant Nevatia,et al.  Single View Human Action Recognition using Key Pose Matching and Viterbi Path Searching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.