Active key frame selection for 3D model reconstruction from crowdsourced geo-tagged videos

Automatic reconstruction of 3D models is attracting increasing attention in the multimedia community. Scene recovery from video sequences requires a selection of representative video frames. Most prior work adopted content-based techniques to automate key frame extraction. However, these methods take no frame geo-information into consideration and are still compute-intensive. Here we propose a new approach for key frame selection based on the geographic properties of videos. Currently, an increasing number of user-generated videos (UGVs) are collected - a trend that is driven by the ubiquitous availability of smartphones. Additionally, it has become easy to continuously acquire and fuse various sensor data (e.g., geo-spatial metadata) with video to create geo-tagged mobile videos. Our novel technique utilizes these underlying geo-metadata to select the most representative frames. Specifically, a key frame subset with minimal spatial coverage gain difference is extracted by incorporating a manifold structure into reproducing a kernel Hilbert space to analyze the spatial relationship among the frames. Our experimental results illustrate that the execution time of the 3D reconstruction is shortened while the model quality is preserved.

[1]  Rama Chellappa,et al.  3D face reconstruction from video using a generic model , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[2]  Roger Zimmermann,et al.  OSCOR: an orientation sensor data correction system for mobile generated contents , 2013, MM '13.

[3]  Roger Zimmermann,et al.  Viewable scene modeling for geospatial video search , 2008, ACM Multimedia.

[4]  Jean Ponce,et al.  Accurate, Dense, and Robust Multiview Stereopsis , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  HighWire Press Philosophical Transactions of the Royal Society of London , 1781, The London Medical Journal.

[6]  Yue Gao,et al.  Representative Discovery of Structure Cues for Weakly-Supervised Image Segmentation , 2014, IEEE Transactions on Multimedia.

[7]  Michael Goesele,et al.  Multi-View Stereo for Community Photo Collections , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[8]  Sang-Hoon Kim,et al.  3D Estimation and Key-Frame Selection for Match Move , 2003 .

[9]  Sudipta N. Sinha,et al.  REAL-TIME VIDEO-BASED RECONSTRUCTION OF URBAN ENVIRONMENTS , 2007 .

[10]  P. Torr Geometric motion segmentation and model selection , 1998, Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[11]  Volker Tresp,et al.  Transductive Experiment Design , 2005 .

[12]  Deng Cai,et al.  Manifold Adaptive Experimental Design for Text Categorization , 2012, IEEE Transactions on Knowledge and Data Engineering.

[13]  Steven M. Seitz,et al.  Photo tourism: exploring photo collections in 3D , 2006, ACM Trans. Graph..

[14]  Li Ling,et al.  A Dense 3D Reconstruction Approach from Uncalibrated Video Sequences , 2012, 2012 IEEE International Conference on Multimedia and Expo Workshops.

[15]  Jong-Soo Choi,et al.  Optimal keyframe selection algorithm for three-dimensional reconstruction in uncalibrated multiple images , 2008 .

[16]  Jia Hao,et al.  Sensor-rich video exploration on a map interface , 2011, MM '11.

[17]  Yihong Gong,et al.  trNon-greedy active learning for text categorization using convex ansductive experimental design , 2008, SIGIR '08.

[18]  Cyrus Shahabi,et al.  MediaQ: mobile multimedia management system , 2014, MMSys '14.

[19]  Matthew N. Dailey,et al.  Robust Key Frame Extraction for 3D Reconstruction from Video Streams , 2010, VISAPP.

[20]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[21]  Mikhail Belkin,et al.  Beyond the point cloud: from transductive to semi-supervised learning , 2005, ICML.

[22]  Roger Zimmermann,et al.  Design and implementation of geo-tagged video search framework , 2010, J. Vis. Commun. Image Represent..

[23]  Richard Szeliski,et al.  Towards Internet-scale multi-view stereo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[24]  Roger Zimmermann,et al.  Motch: an automatic motion type characterization system for sensor-rich videos , 2012, ACM Multimedia.

[25]  Jinbo Bi,et al.  Active learning via transductive experimental design , 2006, ICML.

[26]  Long Quan,et al.  A quasi-dense approach to surface reconstruction from uncalibrated images , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.