Watching, Thinking, Reacting: A Human-Centered Framework for Movie Content Analysis

In this paper, we propose a human-centered framework, “Watching, Thinking, Reacting”, for movie content analysis. The framework consists of a hierarchy of three levels. The low level represents human perception to external stimuli, where the Weber-Fechner Law-based human attention model is constructed to extract movie highlights. The middle level simulates human cognition to semantic, where semantic descriptors are modeled for automatic semantic annotation. The high level imitates human actions based on perception and cognition, where an integrated graph with content and contextual information is proposed for movie highlights correlation and recommendation. Moreover, three recommendation strategies are presented. The promising results of subjective and objective evaluation indicate that the proposed framework can make not only computers intelligently understand movie content, but also provide personalized service for movie highlights recommendation to effectively lead audiences to preview new movies in an individualized manner.

[1]  Shih-Fu Chang,et al.  A utility framework for the automatic generation of audio-visual skims , 2002, MULTIMEDIA '02.

[2]  Chun Chen,et al.  Subspace analysis and optimization for AAM based face alignment , 2004, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[3]  Chia-Hung Yeh,et al.  Techniques for movie content analysis and skimming: tutorial and overview on video abstraction techniques , 2006, IEEE Signal Processing Magazine.

[4]  Yingxu Wang,et al.  On Cognitive Informatics , 2002, Proceedings First IEEE International Conference on Cognitive Informatics.

[5]  Harry W. Agius,et al.  Video summarisation: A conceptual framework and survey of the state of the art , 2008, J. Vis. Commun. Image Represent..

[6]  Xavier Binefa,et al.  An EM algorithm for video summarization, generative model approach , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[7]  Ann L. Brown,et al.  How people learn: Brain, mind, experience, and school. , 1999 .

[8]  J. Avery,et al.  The long tail. , 1995, Journal of the Tennessee Medical Association.

[9]  H. Wactlar,et al.  The Challenges of Continuous Capture , Contemporaneous Analysis , and Customized Summarization of Video Content , 2001 .

[10]  Frédéric Dufaux,et al.  Key Frame Selection to Represent a Video , 2000, ICIP.

[11]  Sheng Tang,et al.  An Innovative Model of Tempo and Its Application in Action Scene Detection for Movie Analysis , 2008, 2008 IEEE Workshop on Applications of Computer Vision.

[12]  Bao-qun Yin,et al.  Power-law strength-degree correlation from resource-allocation dynamics on weighted networks. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[13]  Djemel Ziou,et al.  A Graphical Model for Context-Aware Visual Content Recommendation , 2008, IEEE Transactions on Multimedia.

[14]  Svetha Venkatesh,et al.  Detecting indexical signs in film audio for scene interpretation , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[15]  Yueting Zhuang,et al.  Adaptive key frame extraction using unsupervised clustering , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[16]  Loong Fah Cheong,et al.  Framework for Synthesizing Semantic-Level Indices , 2003, Multimedia Tools and Applications.

[17]  Jonathan Loo,et al.  Semantic Annotation Gap: Where to put Responsibility? , 2009, J. Digit. Content Technol. its Appl..

[18]  Anthony Stefanidis,et al.  Summarizing video datasets in the spatiotemporal domain , 2000, Proceedings 11th International Workshop on Database and Expert Systems Applications.

[19]  Alexander G. Hauptmann,et al.  LSCOM Lexicon Definitions and Annotations (Version 1.0) , 2006 .

[20]  Wayne H. Wolf,et al.  Key frame selection by motion analysis , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[21]  Loong Fah Cheong,et al.  Affective understanding in film , 2006, IEEE Trans. Circuits Syst. Video Technol..

[22]  Jeho Nam,et al.  Dynamic video summarization and visualization , 1999, MULTIMEDIA '99.

[23]  T. Vicsek,et al.  Uncovering the overlapping community structure of complex networks in nature and society , 2005, Nature.

[24]  Sheng Tang,et al.  Human Attention Model for Action Movie Analysis , 2007, 2007 2nd International Conference on Pervasive Computing and Applications.

[25]  Sheng Tang,et al.  TRECVID 2007 High-Level Feature Extraction By MCG-ICT-CAS , 2007, TRECVID.

[26]  Jianhong Shen,et al.  On the foundations of vision modeling: I. Weber’s law and Weberized TV restoration , 2003 .

[27]  D. Arijon,et al.  Grammar of Film Language , 1976 .

[28]  Stephen W. Smoliar,et al.  An integrated system for content-based video retrieval and browsing , 1997, Pattern Recognit..

[29]  Rainer Lienhart,et al.  Localizing and segmenting text in images and videos , 2002, IEEE Trans. Circuits Syst. Video Technol..

[30]  Bai Liang,et al.  Feature analysis and extraction for audio automatic classification , 2005, 2005 IEEE International Conference on Systems, Man and Cybernetics.

[31]  Jonathan Loo,et al.  Semantic Multimedia Annotation - Text Analysis , 2009, J. Digit. Content Technol. its Appl..

[32]  Raymond Spottiswoode,et al.  A grammar of the film , 1950 .

[33]  Guizhong Liu,et al.  A Multiple Visual Models Based Perceptive Analysis Framework for Multilevel Video Summarization , 2007, IEEE Transactions on Circuits and Systems for Video Technology.

[34]  Michael A. Smith,et al.  Video skimming and characterization through the combination of image and language understanding techniques , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[35]  S. Hecht,et al.  THE VISUAL DISCRIMINATION OF INTENSITY AND THE WEBER-FECHNER LAW , 1924, The Journal of general physiology.