Learning the abstract motion semantics of verbs from captioned videos

We propose an algorithm for learning the semantics of a (motion) verb from videos depicting the action expressed by the verb, paired with sentences describing the action participants and their roles. Acknowledging that commonalities among example videos may not exist at the level of the input features, our approximation algorithm efficiently searches the space of more abstract features for a common solution. We test our algorithm by using it to learn the semantics of a sample set of verbs; results demonstrate the usefulness of the proposed framework, while identifying directions for further improvement.

[1]  Sven J. Dickinson,et al.  Generic model abstraction from examples , 2000, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Rémi Ronfard,et al.  Free viewpoint action recognition using motion history volumes , 2006, Comput. Vis. Image Underst..

[3]  Yiannis Aloimonos,et al.  A Language for Human Action , 2007, Computer.

[4]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Robert Givan,et al.  Specific-to-General Learning for Temporal Events with Application to Learning Event Definitions from Video , 2002, J. Artif. Intell. Res..

[6]  Aaron F. Bobick,et al.  Recognition of Visual Activities and Interactions by Stochastic Parsing , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  John F. Canny,et al.  A Computational Approach to Edge Detection , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Jeffrey Mark Siskind,et al.  A Maximum-Likelihood Approach to Visual Event Classification , 1996, ECCV.

[9]  Mubarak Shah,et al.  A differential geometric approach to representing the human actions , 2008, Comput. Vis. Image Underst..

[10]  Alex Pentland,et al.  Space-time gestures , 1993, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Junji Yamato,et al.  Recognizing human action in time-sequential images using hidden Markov model , 1992, Proceedings 1992 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[12]  Anthony Hoogs,et al.  Video content annotation using visual analysis and a large semantic knowledgebase , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..