Lifelog Scene Change Detection Using Cascades of Audio and Video Detectors

The advent of affordable wearable devices with a video camera has established the new form of social data, lifelogs, where lives of people are captured to video. Enormous amount of lifelog data and need for on-site processing demand new fast video processing methods. In this work, we experimentally investigate seven hours of lifelogs and point out novel findings: (1) audio cues are exceptionally strong for lifelog processing; (2) cascades of audio and video detectors improve accuracy and enable fast (super frame rate) processing speed. We first construct strong detectors using state-of-the-art audio and visual features: Mel-frequency cepstral coefficients (MFCC), colour (RGB) histograms, and local patch descriptors (SIFT). In the second stage, we construct a cascade of the trained detectors and optimise cascade parameters. Separating the detector and cascade optimisation stages simplify training and results to a fast and accurate processing pipeline.

[1]  Kilian Q. Weinberger,et al.  Classifier Cascade for Minimizing Feature Evaluation Cost , 2012, AISTATS.

[2]  Anton van den Hengel,et al.  Training Effective Node Classifiers for Cascade Classification , 2013, International Journal of Computer Vision.

[3]  J. Stephen Downie,et al.  Music information retrieval , 2005, Annu. Rev. Inf. Sci. Technol..

[4]  Song-Chun Zhu,et al.  Learning Near-Optimal Cost-Sensitive Decision Policy for Object Detection , 2015, 2013 IEEE International Conference on Computer Vision.

[5]  Rainer Lienhart,et al.  Scene Determination Based on Video and Audio Features , 2004, Multimedia Tools and Applications.

[6]  Paul Over,et al.  Video shot boundary detection: Seven years of TRECVid activity , 2010, Comput. Vis. Image Underst..

[7]  Chengcui Zhang,et al.  Scene change detection by audio and video clues , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[8]  Ullas Gargi,et al.  Characterization of Video-Shot-Change Detection Methods , 2000 .

[9]  Wolfgang Effelsberg,et al.  Scene Determination Based on Video and Audio Features , 1999, Proceedings IEEE International Conference on Multimedia Computing and Systems.

[10]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.

[11]  Yang Song,et al.  Taxonomic classification for web-based videos , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[12]  Eugenia Leu,et al.  The automatic video editor , 2003, MULTIMEDIA '03.

[13]  Paul Over,et al.  TRECVID: evaluating the effectiveness of information retrieval tasks on digital video , 2004, MULTIMEDIA '04.

[14]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[15]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[16]  Meng Wang,et al.  Movie2Comics: Towards a Lively Video Content Presentation , 2012, IEEE Transactions on Multimedia.

[17]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[18]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  François Pachet,et al.  The bag-of-frames approach to audio pattern recognition: a sufficient model for urban soundscapes but not for polyphonic music. , 2007, The Journal of the Acoustical Society of America.

[20]  Hao Jiang,et al.  Video segmentation with the assistance of audio content analysis , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[21]  Annamaria Mesaros,et al.  Sound Event Detection in Multisource Environments Using Source Separation , 2011 .

[22]  László Böszörményi,et al.  State-of-the-art and future challenges in video scene detection: a survey , 2013, Multimedia Systems.

[23]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[24]  Ioannis Pitas,et al.  Enhanced Eigen-Audioframes for Audiovisual Scene Change Detection , 2007, IEEE Transactions on Multimedia.

[25]  Bin Zhao,et al.  Quasi Real-Time Summarization for Consumer Videos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Christoph H. Lampert,et al.  Unsupervised Object Discovery: A Comparison , 2010, International Journal of Computer Vision.

[27]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.