Automatic extraction of relevant video shots of specific actions exploiting Web data

Video sharing websites have recently become a tremendous video source, which is easily accessible without any costs. This has encouraged researchers in the action recognition field to construct action database exploiting Web sources. However Web sources are generally too noisy to be used directly as a recognition database. Thus building action database from Web sources has required extensive human efforts on manual selection of video parts related to specified actions. In this paper, we introduce a novel method to automatically extract video shots related to given action keywords from Web videos according to their metadata and visual features. First, we select relevant videos among tagged Web videos based on the relevance between their tags and the given keyword. After segmenting selected videos into shots, we rank these shots exploiting their visual features in order to obtain shots of interest as top ranked shots. Especially, we propose to adopt Web images and human pose matching method in shot ranking step and show that this application helps to boost more relevant shots to the top. This unsupervised method of ours only requires the provision of action keywords such as ''surf wave'' or ''bake bread'' at the beginn ing. We have made large-scale experiments on various kinds of human actions as well as non-human actions and obtained promising results.

[1]  Jiebo Luo,et al.  Recognizing realistic actions from videos , 2009, CVPR.

[2]  Dong Liu,et al.  Tag ranking , 2009, WWW '09.

[3]  Luc Van Gool,et al.  SURF: Speeded Up Robust Features , 2006, ECCV.

[4]  Yi Yang,et al.  A Multimedia Retrieval Framework Based on Semi-Supervised Ranking and Relevance Feedback , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Yi Yang,et al.  Articulated pose estimation with flexible mixtures-of-parts , 2011, CVPR 2011.

[6]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, ICPR 2004.

[7]  Bohyung Han,et al.  Extracting Moving People from Internet Videos , 2008, ECCV.

[8]  Taher H. Haveliwala Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search , 2003, IEEE Trans. Knowl. Data Eng..

[9]  Ling Shao,et al.  Embedding Motion and Structure Features for Action Recognition , 2013, IEEE Transactions on Circuits and Systems for Video Technology.

[10]  Antonio Criminisi,et al.  Harvesting Image Databases from the Web , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[11]  Luc Van Gool,et al.  Action snippets: How many frames does human action recognition require? , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Keiji Yanai,et al.  Generic image classification using visual knowledge on the web , 2003, ACM Multimedia.

[13]  Edmond Boyer,et al.  Action recognition using exemplar-based embedding , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Ling Shao,et al.  Relevance feedback for real-world human action retrieval , 2012, Pattern Recognit. Lett..

[16]  Ling Shao,et al.  Boosted key-frame selection and correlated pyramidal motion-feature representation for human action recognition , 2013, Pattern Recognit..

[17]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[18]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2008, International Journal of Computer Vision.

[19]  Gang Wang,et al.  Web 2.0 dictionary , 2008, CIVR '08.

[20]  Pietro Perona,et al.  A Visual Category Filter for Google Images , 2004, ECCV.

[21]  Yi Yang,et al.  Harmonizing Hierarchical Manifolds for Multimedia Document Semantics Understanding and Cross-Media Retrieval , 2008, IEEE Transactions on Multimedia.

[22]  Jean Ponce,et al.  Automatic annotation of human actions in video , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[23]  Yang Wang,et al.  Recognizing human actions from still images with latent poses , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[24]  Gang Wang,et al.  OPTIMOL: automatic Online Picture collecTion via Incremental MOdel Learning , 2007, CVPR.

[25]  Shumeet Baluja,et al.  VisualRank: Applying PageRank to Large-Scale Image Search , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Keiji Yanai,et al.  Automatic collection of Web video shots corresponding to specific actions using Web images , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[27]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[28]  Keiji Yanai,et al.  A SURF-Based Spatio-Temporal Feature for Feature-Fusion-Based Action Recognition , 2010, ECCV Workshops.

[29]  Alberto Del Bimbo,et al.  Tag suggestion and localization in user-generated videos based on social knowledge , 2010, WSM@MM.

[30]  Nazli Ikizler-Cinbis,et al.  Learning actions from the Web , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[31]  Keiji Yanai,et al.  Automatic construction of an action video shot database using web videos , 2011, 2011 International Conference on Computer Vision.

[32]  Pietro Perona,et al.  Learning object categories from Google's image search , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[33]  Keiji Yanai,et al.  Probabilistic web image gathering , 2005, MIR '05.

[34]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[35]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[36]  Václav Hlavác,et al.  Pose primitive based human action recognition in videos or still images , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Jitendra Malik,et al.  Poselets: Body part detectors trained using 3D human pose annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.