论文信息 - Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization

Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization

The emergence of wearable devices such as portable cameras and smart glasses makes it possible to record life logging first-person videos. Browsing such long unstructured videos is time-consuming and tedious. This paper studies the discovery of moments of user's major or special interest (i.e., highlights) in a video, for generating the summarization of first-person videos. Specifically, we propose a novel pairwise deep ranking model that employs deep learning techniques to learn the relationship between high-light and non-highlight video segments. A two-stream network structure by representing video segments from complementary information on appearance of video frames and temporal dynamics across frames is developed for video highlight detection. Given a long personal video, equipped with the highlight detection model, a highlight score is assigned to each segment. The obtained highlight segments are applied for summarization in two ways: video time-lapse and video skimming. The former plays the highlight (non-highlight) segments at low (high) speed rates, while the latter assembles the sequence of segments with the highest scores. On 100 hours of first-person videos for 15 unique sports categories, our highlight detection achieves the improvement over the state-of-the-art RankSVM method by 10.5% in terms of accuracy. Moreover, our approaches produce video summary with better quality by a user study from 35 human subjects.

[1] Nitish Srivastava,et al. Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[2] Michael T. Goodrich,et al. Algorithm Design: Foundations, Analysis, and Internet Examples , 2001 .

[3] Kristen Grauman,et al. Story-Driven Summarization for Egocentric Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[4] Fei-Fei Li,et al. Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[5] Cordelia Schmid,et al. Action and Event Recognition with Fisher Vectors on a Compact Feature Set , 2013, 2013 IEEE International Conference on Computer Vision.

[6] Yang Song,et al. Learning Fine-Grained Image Similarity with Deep Ranking , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[7] Ming Yang,et al. 3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[9] Chong-Wah Ngo,et al. Annotation for free: video tagging by mining user search behavior , 2013, ACM Multimedia.

[10] Surya Nepal,et al. Automatic detection of 'Goal' segments in basketball videos , 2001, MULTIMEDIA '01.

[11] Lie Lu,et al. A generic framework of user attention model and its application in video summarization , 2005, IEEE Trans. Multim..

[12] Chong-Wah Ngo,et al. Click-through-based cross-view learning for image search , 2014, SIGIR.

[13] Chong-Wah Ngo,et al. Video summarization and scene detection by graph modeling , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[14] C. Schmid,et al. Category-Specific Video Summarization , 2014, ECCV.

[15] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[16] Yong Jae Lee,et al. Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[17] Michael F. Cohen,et al. Real-time hyperlapse creation via optimal frame selection , 2015, ACM Trans. Graph..

[18] Pavol Návrat. Review of "Algorithm design: foundations, analysis and internet examples" by Michael T. Goodrich and Roberto Tamassia. John Wiley & Sons, Inc. 2001. , 2004, SIGA.

[19] Trevor Darrell,et al. Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[20] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[21] M. Goodale,et al. Separate visual pathways for perception and action , 1992, Trends in Neurosciences.

[22] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[23] Gang Hua,et al. Scene Aligned Pooling for Complex Video Recognition , 2012, ECCV.

[24] Ali Farhadi,et al. Ranking Domain-Specific Highlights by Analyzing Edited Videos , 2014, ECCV.

[25] Anoop Gupta,et al. Automatically extracting highlights for TV Baseball programs , 2000, ACM Multimedia.

[26] Chong-Wah Ngo,et al. Learning Query and Image Similarities with Ranking Canonical Correlation Analysis , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[27] Luc Van Gool,et al. Creating Summaries from User Videos , 2014, ECCV.

[28] Chih-Jen Lin,et al. LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[29] Tao Mei,et al. Near-lossless semantic video summarization and its applications to video analysis , 2013, TOMCCAP.

[30] Cordelia Schmid,et al. Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.