will be inserted by the editor ) Continuous Action Recognition Based on Sequence Alignment

Continuous action recognition is more challenging than isolated recognition because classification and segmentation must be simultaneously carried out. We build on the well known dynamic time warping (DTW) framework and devise a novel visual alignment technique, namely dynamic frame warping (DFW), which performs isolated recognition based on per-frame representation of videos, and on aligning a test sequence with a model sequence. Moreover, we propose two extensions which enable to perform recognition concomitant with segmentation, namely one-pass DFW and two-pass DFW. These two methods have their roots in the domain of continuous recognition of speech and, to the best of our knowledge, their extension to continuous visual action recognition has been overlooked. We test and illustrate the proposed techniques with a recently The authors acknowledge support from the European project HUMAVIPS #247525 (2010-2013) and from the ERC Advanced Grant VHIA #340113 (2014-2019). J. Cech acknowledges support from the Czech Science Foundation Project GACR P103/12/G084. Kaustubh Kulkarni INRIA Grenoble Rhône-Alpes Montbonnot Saint-Martin, FRANCE E-mail: Kaustubh.Kulkarni@inria.fr Georgios Evangelidis INRIA Grenoble Rhône-Alpes Montbonnot Saint-Martin, FRANCE E-mail: Georgios.Evangelidis@inria.fr Jan Cech Center for Machine Perception Czech Technical University in Prague, Czech Republic E-mail: cechj@cmp.felk.cvut.cz Radu Horaud INRIA Grenoble Rhône-Alpes Montbonnot Saint-Martin, FRANCE E-mail: Radu.Horaud@inria.fr released dataset (RAVEL) and with two public-domain datasets widely used in action recognition (Hollywood1 and Hollywood-2). We also compare the performances of the proposed isolated and continuous recognition algorithms with several recently published methods.

[1]  Yihong Gong,et al.  Latent Pose Estimator for Continuous Action Recognition , 2008, ECCV.

[2]  Dimitris N. Metaxas,et al.  A Framework for Recognizing the Simultaneous Aspects of American Sign Language , 2001, Comput. Vis. Image Underst..

[3]  Trevor Darrell,et al.  Latent-Dynamic Discriminative Models for Continuous Gesture Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Mubarak Shah,et al.  Classifying web videos using a global video descriptor , 2013, Machine Vision and Applications.

[5]  Ming Ouhyoung,et al.  A real-time continuous gesture recognition system for sign language , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[6]  T. K. Vintsyuk Element-wise recognition of continuous speech composed of words from a specified dictionary , 1971, CYBERNETICS.

[7]  Georgios D. Evangelidis,et al.  Parametric Image Alignment Using Enhanced Correlation Coefficient Maximization , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Eraldo Ribeiro,et al.  Human Motion Recognition Using Isomap and Dynamic Time Warping , 2007, Workshop on Human Motion.

[9]  Andrew Zisserman,et al.  Efficient Visual Search of Videos Cast as Text Retrieval , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Mark J. F. Gales,et al.  The Application of Hidden Markov Models in Speech Recognition , 2007, Found. Trends Signal Process..

[11]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[12]  Steve Young,et al.  Token passing: a simple conceptual model for connected speech recognition systems , 1989 .

[13]  Chin-Hui Lee,et al.  A frame-synchronous network search algorithm for connected word recognition , 1989, IEEE Trans. Acoust. Speech Signal Process..

[14]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[15]  Fernando De la Torre,et al.  Canonical Time Warping for Alignment of Human Behavior , 2009, NIPS.

[16]  Cristian Sminchisescu,et al.  Conditional models for contextual human motion recognition , 2006, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[17]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[18]  Albert Wang,et al.  The In-Crowd Algorithm for Fast Basis Pursuit Denoising , 2011, IEEE Transactions on Signal Processing.

[19]  Meinard Müller,et al.  Information retrieval for music and motion , 2007 .

[20]  Gérard G. Medioni,et al.  Dynamic Manifold Warping for view invariant action recognition , 2011, 2011 International Conference on Computer Vision.

[21]  Hermann Ney,et al.  Dynamic programming search for continuous speech recognition , 1999, IEEE Signal Process. Mag..

[22]  V. Ramasubramanian,et al.  A Framework for Indexing Human Actions in Video , 2008 .

[23]  Hermann Ney,et al.  The use of a one-stage dynamic programming algorithm for connected word recognition , 1984 .

[24]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Ivan Laptev,et al.  Improving bag-of-features action recognition with non-local cues , 2010, BMVC.

[26]  Joel A. Tropp,et al.  Signal Recovery From Random Measurements Via Orthogonal Matching Pursuit , 2007, IEEE Transactions on Information Theory.

[27]  Radu Horaud,et al.  RAVEL: an annotated corpus for training robots with audiovisual abilities , 2013, Journal on Multimodal User Interfaces.

[28]  Ramakant Nevatia,et al.  Single View Human Action Recognition using Key Pose Matching and Viterbi Path Searching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Radu Horaud,et al.  Action Recognition Robust to Background Clutter by Using Stereo Vision , 2012, ECCV Workshops.

[30]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[31]  MetaxasDimitris,et al.  Conditional models for contextual human motion recognition , 2006 .

[32]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[33]  Patrick Bouthemy,et al.  Better Exploiting Motion for Better Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Stan Sclaroff,et al.  A Unified Framework for Gesture Recognition and Spatiotemporal Gesture Segmentation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Manuela M. Veloso,et al.  Conditional random fields for activity recognition , 2007, AAMAS '07.

[36]  Michael J. Black,et al.  HumanEva: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion , 2010, International Journal of Computer Vision.

[37]  Chong-Wah Ngo,et al.  Trajectory-Based Modeling of Human Actions with Motion Reference Points , 2012, ECCV.

[38]  Dimitris N. Metaxas,et al.  ASL recognition based on a coupling between HMMs and 3D motion analysis , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[39]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[40]  C. Schmid,et al.  Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Alex Pentland,et al.  Real-Time American Sign Language Recognition Using Desk and Wearable Computer Based Video , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[42]  H. Sakoe,et al.  Two-level DP-matching--A dynamic programming-based pattern matching algorithm for connected word recognition , 1979 .

[43]  Ramakant Nevatia,et al.  Recognition and Segmentation of 3-D Human Action Using HMM and Multi-class AdaBoost , 2006, ECCV.

[44]  William Brendel,et al.  Activities as Time Series of Human Postures , 2010, ECCV.