Extracting key frames from first-person videos in the common space of multiple sensors

Selecting authentic scenes about activities of daily living (ADL) is useful to support our memory of everyday life. Key-frame extraction for first-person vision (FPV) videos is a core technology to realize such memory assistant. However, most existing key-frame extraction methods have mainly focused on stable scenes not related to ADL and only used visual signals of the image sequence even though the activities usually associate with our visual experience. To deal with dynamically changing scenes of FPV about daily activities, integrating motion and visual signals are essential. In this paper, we present a novel key-frame extraction method for ADL, which integrates multi-modal sensor signals to temper noise and detect salient activities. Our proposed method projects motion and visual features to a shared space by a probabilistic canonical correlation analysis and selects key frames there. The experimental results using ADL datasets collected in a house suggest that our key-frame extraction technique running in the shared space improves the precision of extracted key frames and the coverage of the entire video.

[1]  Elia Bruni,et al.  Multimodal Distributional Semantics , 2014, J. Artif. Intell. Res..

[2]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[3]  Guillermo Sapiro,et al.  See all by looking at a few: Sparse modeling for finding representative objects , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Terrence J. Sejnowski,et al.  Learning Overcomplete Representations , 2000, Neural Computation.

[5]  Joseph F. Murray,et al.  Dictionary Learning Algorithms for Sparse Representation , 2003, Neural Computation.

[6]  Sung Wook Baik,et al.  Adaptive key frame extraction for video summarization using an aggregation mechanism , 2012, J. Vis. Commun. Image Represent..

[7]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[8]  Michael Elad,et al.  Sparse and Redundant Representations - From Theory to Applications in Signal and Image Processing , 2010 .

[9]  Yujie Li,et al.  Key frame extraction from first-person video with multi-sensor integration , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[10]  Mateu Sbert,et al.  Browsing and exploration of video sequences: A new scheme for key frame extraction and 3D visualization using entropy based Jensen divergence , 2014, Inf. Sci..

[11]  Junsong Yuan,et al.  Representative Selection with Structured Sparsity , 2017, Pattern Recognit..

[12]  Naokazu Yokoya,et al.  Video Summarization Using Deep Semantic Features , 2016, ACCV.

[13]  Yu Zhang,et al.  Brain extraction based on locally linear representation-based classification , 2014, NeuroImage.

[14]  Takuya Maekawa,et al.  Egocentric Video Search via Physical Interactions , 2016, AAAI.

[15]  Xiao Liu,et al.  Joint shot boundary detection and key frame extraction , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[16]  Lei Pan,et al.  Key Frame Extraction Based on Sub-Shot Segmentation and Entropy Computing , 2009, 2009 Chinese Conference on Pattern Recognition.

[17]  Zhenni Li,et al.  A Fast Algorithm for Learning Overcomplete Dictionary for Sparse Representation Based on Proximal Operators , 2015, Neural Computation.

[18]  Changsheng Xu,et al.  Exploiting user information for image tag refinement , 2011, MM '11.

[19]  Michael Elad,et al.  Optimally sparse representation in general (nonorthogonal) dictionaries via ℓ1 minimization , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[21]  David S. Doermann,et al.  Video summarization by curve simplification , 1998, MULTIMEDIA '98.

[22]  Michael I. Jordan,et al.  A Probabilistic Interpretation of Canonical Correlation Analysis , 2005 .

[23]  Balas K. Natarajan,et al.  Sparse Approximate Solutions to Linear Systems , 1995, SIAM J. Comput..

[24]  Angeliki Lazaridou,et al.  Is this a wampimuk? Cross-modal mapping between distributional semantics and the visual world , 2014, ACL.