论文信息 - Lifelog Scene Change Detection Using Cascades of Audio and Video Detectors

Lifelog Scene Change Detection Using Cascades of Audio and Video Detectors

The advent of affordable wearable devices with a video camera has established the new form of social data, lifelogs, where lives of people are captured to video. Enormous amount of lifelog data and need for on-site processing demand new fast video processing methods. In this work, we experimentally investigate seven hours of lifelogs and point out novel findings: (1) audio cues are exceptionally strong for lifelog processing; (2) cascades of audio and video detectors improve accuracy and enable fast (super frame rate) processing speed. We first construct strong detectors using state-of-the-art audio and visual features: Mel-frequency cepstral coefficients (MFCC), colour (RGB) histograms, and local patch descriptors (SIFT). In the second stage, we construct a cascade of the trained detectors and optimise cascade parameters. Separating the detector and cascade optimisation stages simplify training and results to a fast and accurate processing pipeline.

[1] Kilian Q. Weinberger,et al. Classifier Cascade for Minimizing Feature Evaluation Cost , 2012, AISTATS.

[2] Anton van den Hengel,et al. Training Effective Node Classifiers for Cascade Classification , 2013, International Journal of Computer Vision.

[3] J. Stephen Downie,et al. Music information retrieval , 2005, Annu. Rev. Inf. Sci. Technol..

[4] Song-Chun Zhu,et al. Learning Near-Optimal Cost-Sensitive Decision Policy for Object Detection , 2015, 2013 IEEE International Conference on Computer Vision.

[5] Rainer Lienhart,et al. Scene Determination Based on Video and Audio Features , 2004, Multimedia Tools and Applications.

[6] Paul Over,et al. Video shot boundary detection: Seven years of TRECVid activity , 2010, Comput. Vis. Image Underst..

[7] Chengcui Zhang,et al. Scene change detection by audio and video clues , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[8] Ullas Gargi,et al. Characterization of Video-Shot-Change Detection Methods , 2000 .

[9] Wolfgang Effelsberg,et al. Scene Determination Based on Video and Audio Features , 1999, Proceedings IEEE International Conference on Multimedia Computing and Systems.

[10] Luc Van Gool,et al. Creating Summaries from User Videos , 2014, ECCV.

[11] Yang Song,et al. Taxonomic classification for web-based videos , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[12] Eugenia Leu,et al. The automatic video editor , 2003, MULTIMEDIA '03.

[13] Paul Over,et al. TRECVID: evaluating the effectiveness of information retrieval tasks on digital video , 2004, MULTIMEDIA '04.

[14] Andrew Zisserman,et al. Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[15] G LoweDavid,et al. Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[16] Meng Wang,et al. Movie2Comics: Towards a Lively Video Content Presentation , 2012, IEEE Transactions on Multimedia.

[17] Gabriela Csurka,et al. Visual categorization with bags of keypoints , 2002, eccv 2004.

[18] Jiri Matas,et al. On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[19] François Pachet,et al. The bag-of-frames approach to audio pattern recognition: a sufficient model for urban soundscapes but not for polyphonic music. , 2007, The Journal of the Acoustical Society of America.

[20] Hao Jiang,et al. Video segmentation with the assistance of audio content analysis , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[21] Annamaria Mesaros,et al. Sound Event Detection in Multisource Environments Using Source Separation , 2011 .

[22] László Böszörményi,et al. State-of-the-art and future challenges in video scene detection: a survey , 2013, Multimedia Systems.

[23] Biing-Hwang Juang,et al. Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[24] Ioannis Pitas,et al. Enhanced Eigen-Audioframes for Audiovisual Scene Change Detection , 2007, IEEE Transactions on Multimedia.

[25] Bin Zhao,et al. Quasi Real-Time Summarization for Consumer Videos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[26] Christoph H. Lampert,et al. Unsupervised Object Discovery: A Comparison , 2010, International Journal of Computer Vision.

[27] Paul A. Viola,et al. Robust Real-Time Face Detection , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.