One video is sufficient? Human activity recognition using active video composition

In this paper, we present a novel human activity recognition approach that only requires a single video example per activity. We introduce the paradigm of active video composition, which enables one-example recognition of complex activities. The idea is to automatically create a large number of semi-artificial training videos called composed videos by manipulating an original human activity video. A methodology to automatically compose activity videos having different backgrounds, translations, scales, actors, and movement structures is described in this paper. Furthermore, an active learning algorithm to model the temporal structure of the human activity has been designed, preventing the generation of composed training videos violating the structural constraints of the activity. The intention is to generate composed videos having correct organizations, and take advantage of them for the training of the recognition system. In contrast to previous passive recognition systems relying only on given training videos, our methodology actively composes necessary training videos that the system is expected to observe in its environment. Experimental results illustrate that a single fully labeled video per activity is sufficient for our methodology to reliably recognize human activities by utilizing composed training videos.

[1]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[2]  Peyman Milanfar,et al.  Detection of human actions from a single example , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[3]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[4]  Rama Chellappa,et al.  Machine Recognition of Human Activities: A Survey , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[5]  Tae-Kyun Kim,et al.  Learning Motion Categories using both Semantic and Structural Information , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Alex Pentland,et al.  A Bayesian Computer Vision System for Modeling Human Interactions , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Eli Shechtman,et al.  Matching Local Self-Similarities across Images and Videos , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Jake K. Aggarwal,et al.  Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[10]  Greg Schohn,et al.  Less is More: Active Learning with Support Vector Machines , 2000, ICML.

[11]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[12]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories , 2006 .

[14]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[15]  Guillermo Sapiro,et al.  Video SnapCut: robust video object cutout using localized classifiers , 2009, ACM Trans. Graph..

[16]  Jitendra Malik,et al.  Recognizing action at a distance , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[17]  Daphne Koller,et al.  Support Vector Machine Active Learning with Application sto Text Classification , 2000, ICML.

[18]  Edward Y. Chang,et al.  Support Vector Machine Concept-Dependent Active Learning for Image Retrieval , 2005 .

[19]  Jake K. Aggarwal,et al.  Semantic Representation and Recognition of Continued and Recursive Human Activities , 2009, International Journal of Computer Vision.

[20]  Rong Yan,et al.  Automatically labeling video data using multi-class active learning , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[21]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[22]  Irfan A. Essa,et al.  Graphcut textures: image and video synthesis using graph cuts , 2003, ACM Trans. Graph..