Predicting audio-visual salient events based on visual, audio and text modalities for movie summarization

In this paper, we present a new and improved synergistic approach to the problem of audio-visual salient event detection and movie summarization based on visual, audio and text modalities. Spatio-temporal visual saliency is estimated through a perceptually inspired frontend based on 3D (space, time) Gabor filters and frame-wise features are extracted from the saliency volumes. For the auditory salient event detection we extract features based on Teager-Kaiser Energy Operator, while text analysis incorporates part-of-speech tagging and affective modeling of single words on the movie subtitles. For the evaluation of the proposed system, we employ an elementary and non-parametric classification technique like KNN. Detection results are reported on the MovSum database, using objective evaluations against ground-truth denoting the perceptually salient events, and human evaluations of the movie summaries. Our evaluation verifies the appropriateness of the proposed methods compared to our baseline system. Finally, our newly proposed summarization algorithm produces summaries that consist of salient and meaningful events, also improving the comprehension of the semantics.

[1]  Petros Maragos,et al.  A saliency-based approach to audio event detection and summarization , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[2]  Hugo Fastl,et al.  Psychoacoustics: Facts and Models , 1990 .

[3]  Lie Lu,et al.  A generic framework of user attention model and its application in video summarization , 2005, IEEE Trans. Multim..

[4]  H. M. Teager,et al.  Evidence for Nonlinear Sound Production Mechanisms in the Vocal Tract , 1990 .

[5]  Alfredo Restrepo,et al.  Localized measurement of emergent image frequencies by Gabor wavelets , 1992, IEEE Trans. Inf. Theory.

[6]  Shrikanth S. Narayanan,et al.  Distributional Semantic Models for Affective Text Analysis , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Petros Maragos,et al.  On the Effects of Filterbank Design and Energy Computation on Robust Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Hugo Fastl,et al.  Psychoacoustics Facts and Models. 2nd updated edition , 1999 .

[9]  Yael Pritch,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2008 1 Non-Chronological Video , 2022 .

[10]  Shrikanth S. Narayanan,et al.  Toward detecting emotions in spoken dialogs , 2005, IEEE Transactions on Speech and Audio Processing.

[11]  M. Bradley,et al.  Affective Norms for English Words (ANEW): Instruction Manual and Affective Ratings , 1999 .

[12]  Petros Maragos,et al.  Advances on action recognition in videos using an interest point detector based on multiband spatio-temporal energies , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[13]  M. Bradley,et al.  Affective Normsfor English Words (ANEW): Stimuli, instruction manual and affective ratings (Tech Report C-1) , 1999 .

[14]  Chong-Wah Ngo,et al.  Summarizing Rushes Videos by Motion, Object, and Event Understanding , 2012, IEEE Transactions on Multimedia.

[15]  Michael L. Littman,et al.  Unsupervised Learning of Semantic Orientation from a Hundred-Billion-Word Corpus , 2002, ArXiv.

[16]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[17]  S. David,et al.  Auditory attention : focusing the searchlight on sound , 2007 .

[18]  J. F. Kaiser,et al.  On a simple algorithm to calculate the 'energy' of a signal , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[19]  J. Daugman Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical filters. , 1985, Journal of the Optical Society of America. A, Optics and image science.

[20]  Petros Maragos,et al.  Multimodal Saliency and Fusion for Movie Summarization Based on Aural, Visual, and Textual Attention , 2013, IEEE Transactions on Multimedia.

[21]  Alan Hanjalic,et al.  An integrated scheme for automated video abstraction based on unsupervised cluster-validity analysis , 1999, IEEE Trans. Circuits Syst. Video Technol..

[22]  Harry W. Agius,et al.  Video summarisation: A conceptual framework and survey of the state of the art , 2008, J. Vis. Commun. Image Represent..

[23]  Yongjie Li,et al.  A Color Constancy Model with Double-Opponency Mechanisms , 2013, 2013 IEEE International Conference on Computer Vision.

[24]  D J Heeger,et al.  Model for the extraction of image flow. , 1987, Journal of the Optical Society of America. A, Optics and image science.

[25]  Petros Maragos,et al.  Video event detection and summarization using audio, visual and text saliency , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[26]  J. Daugman Two-dimensional spectral analysis of cortical receptive field profiles , 1980, Vision Research.

[27]  S. Shamma,et al.  Interaction between Attention and Bottom-Up Saliency Mediates the Representation of Foreground and Background in an Auditory Scene , 2009, PLoS biology.

[28]  Xin Liu,et al.  Video summarization using singular value decomposition , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[29]  Yueting Zhuang,et al.  Adaptive key frame extraction using unsupervised clustering , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[30]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[31]  Michael T. Lippert,et al.  Mechanisms for Allocating Auditory Attention: An Auditory Saliency Map , 2005, Current Biology.

[32]  Xavier Binefa,et al.  An EM algorithm for video summarization, generative model approach , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[33]  Alan C. Bovik,et al.  Multidimensional quasi-eigenfunction approximations and multicomponent AM-FM models , 2000, IEEE Trans. Image Process..

[34]  Prashant Parikh A Theory of Communication , 2010 .

[35]  Preslav Nakov,et al.  SemEval-2013 Task 2: Sentiment Analysis in Twitter , 2013, *SEMEVAL.

[36]  Janusz Konrad,et al.  Video Condensation by Ribbon Carving , 2009, IEEE Transactions on Image Processing.

[37]  Zhu Liu,et al.  Multimedia content analysis-using both audio and visual clues , 2000, IEEE Signal Process. Mag..

[38]  S Ullman,et al.  Shifts in selective visual attention: towards the underlying neural circuitry. , 1985, Human neurobiology.

[39]  R. Plomp,et al.  Tonal consonance and critical bandwidth. , 1965, The Journal of the Acoustical Society of America.

[40]  A. Coutrot,et al.  How saliency, faces, and sound influence gaze in dynamic social scenes. , 2014, Journal of vision.