Localizing web videos using social images

While inferring the geo-locations of web images has been widely studied, there is limited work engaging in geo-location inference of web videos due to inadequate labeled samples available for training. However, such a geographical localization functionality is of great importance to help existing video sharing websites provide location-aware services, such as location-based video browsing, video geo-tag recommendation, and location sensitive video search on mobile devices. In this paper, we address the problem of localizing web videos through transferring large-scale web images with geographic tags to web videos, where near-duplicate detection between images and video frames is conducted to link the visually relevant web images and videos. To perform our approach, we choose the trustworthy web images by evaluating the consistency between the visual features and associated metadata of the collected images, therefore eliminating the noisy images. In doing so, a novel transfer learning algorithm is proposed to align the landmark prototypes across both domains of images and video frames, leading to a reliable prediction of the geo-locations of web videos. A group of experiments are carried out on two datasets which collect Flickr images and YouTube videos crawled from the Web. The experimental results demonstrate the effectiveness of our video geo-location inference approach which outperforms several competing approaches using the traditional frame-level video geo-location inference.

[1]  Yue Gao,et al.  Camera Constraint-Free View-Based 3-D Object Retrieval , 2012, IEEE Transactions on Image Processing.

[2]  Wei-Ying Ma,et al.  AnnoSearch: Image Auto-Annotation by Search , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[3]  Ming Yang,et al.  Query Specific Fusion for Image Retrieval , 2012, ECCV.

[4]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[5]  Yue Gao,et al.  3-D Object Retrieval and Recognition With Hypergraph Analysis , 2012, IEEE Transactions on Image Processing.

[6]  Steven M. Seitz,et al.  Finding paths through the world's photos , 2008, SIGGRAPH 2008.

[7]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[8]  Sebastian Nowozin,et al.  On feature combination for multiclass object classification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[9]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[10]  Mor Naaman,et al.  How flickr helps us make sense of the world: context and content in community-contributed media collections , 2007, ACM Multimedia.

[11]  Wen Gao,et al.  Location Discriminative Vocabulary Coding for Mobile Landmark Search , 2011, International Journal of Computer Vision.

[12]  Meng Wang,et al.  Unified Video Annotation via Multigraph Learning , 2009, IEEE Transactions on Circuits and Systems for Video Technology.

[13]  Jon M. Kleinberg,et al.  Mapping the world's photos , 2009, WWW '09.

[14]  Xin Chen,et al.  City-scale landmark identification on mobile devices , 2011, CVPR 2011.

[15]  Qi Tian,et al.  Learning heterogeneous data for hierarchical web video classification , 2011, MM '11.

[16]  Xuelong Li,et al.  Visual-Textual Joint Relevance Learning for Tag-Based Social Image Search , 2013, IEEE Transactions on Image Processing.

[17]  Mohammad Soleymani,et al.  Automatic tagging and geotagging in video collections and communities , 2011, ICMR.

[18]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[19]  Ivor W. Tsang,et al.  Visual Event Recognition in Videos by Learning from Web Data , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Ivor W. Tsang,et al.  Domain Transfer SVM for video concept detection , 2009, CVPR 2009.

[21]  Meng Wang,et al.  Event Driven Web Video Summarization by Tag Localization and Key-Shot Identification , 2012, IEEE Transactions on Multimedia.

[22]  Chong-Wah Ngo,et al.  Semantic context transfer across heterogeneous sources for domain adaptive video search , 2009, ACM Multimedia.

[23]  Xing Xie,et al.  Mining city landmarks from blogs by graph modeling , 2009, ACM Multimedia.

[24]  Xian-Sheng Hua,et al.  Towards a Relevant and Diverse Search of Social Images , 2010, IEEE Transactions on Multimedia.

[25]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[26]  Bingbing Ni,et al.  Assistive tagging: A survey of multimedia tagging with human-computer joint exploration , 2012, CSUR.

[27]  Thomas Sikora,et al.  Multi-modal, multi-resource methods for placing Flickr videos on the map , 2011, ICMR.

[28]  Yang Song,et al.  Tour the world: Building a web-scale landmark recognition engine , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  B. C. Brookes,et al.  Information Sciences , 2020, Cognitive Skills You Need for the 21st Century.

[30]  Shumeet Baluja,et al.  VisualRank: Applying PageRank to Large-Scale Image Search , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Richard Szeliski,et al.  Finding paths through the world's photos , 2008, ACM Trans. Graph..

[32]  Mor Naaman,et al.  Generating diverse and representative image search results for landmarks , 2008, WWW.

[33]  Alexei A. Efros,et al.  IM2GPS: estimating geographic information from a single image , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Rong Yan,et al.  Cross-domain video concept detection using adaptive svms , 2007, ACM Multimedia.

[35]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[36]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.