Affective Audio-Visual Words and Latent Topic Driving Model for Realizing Movie Affective Scene Classification

This paper presents a novel method for movie affective scene classification that outputs the emotion (in the form of labels) that the scene is likely to arouse in viewers. Since the affective preferences of users play an important role in movie selection, affective scene classification has the potential to develop more attractive user-centric movie search and browsing applications. Two main issues in designing movie affective scene classification are considered. One is “how to extract features that are strongly related to the viewer's emotions”, and the other is “how to map the extracted features to the emotion categories”. For the former, we propose a method to extract emotion-category-specific audio-visual features named affective audio-visual words (AAVWs). For the latter issue, we propose a classification model named latent topic driving model (LTDM). Assuming that viewers' emotions are dynamically changed by the movie scene sequences, LTDM models emotions as Markovian dynamic systems driven by the sequential stimuli of the movie content. Experiments on 206 movie scenes extracted from 24 movie titles and the corresponding labels of eight emotion categories given by 16 subjects show that our method outperforms conventional approaches in terms of the subject agreement rate.

[1]  David Talkin,et al.  A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[2]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[3]  Kiyoharu Aizawa,et al.  Latent topic driving model for movie affective scene classification , 2009, MM '09.

[4]  John J. B. Allen,et al.  The handbook of emotion elicitation and assessment , 2007 .

[5]  Regunathan Radhakrishnan,et al.  Generation of sports highlights using motion activity in combination with a common audio feature extraction framework , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[6]  Dietrich Klakow,et al.  Language model adaptation using dynamic marginals , 1997, EUROSPEECH.

[7]  P. Wilson,et al.  The Nature of Emotions , 2012 .

[8]  Min Xu,et al.  Affective content analysis in comedy and horror videos by audio emotional event detection , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[9]  P. Ekman Universals and cultural differences in facial expressions of emotion. , 1972 .

[10]  Akiko Aizawa,et al.  An information-theoretic perspective of tf-idf measures , 2003, Inf. Process. Manag..

[11]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[12]  L. F. Barrett Discrete Emotions or Dimensions? The Role of Valence Focus and Arousal Focus , 1998 .

[13]  Anoop Gupta,et al.  Automatically extracting highlights for TV Baseball programs , 2000, ACM Multimedia.

[14]  Nicu Sebe,et al.  Content-based multimedia information retrieval: State of the art and challenges , 2006, TOMCCAP.

[15]  Ba Tu Truong,et al.  Video abstraction: A systematic review and classification , 2007, TOMCCAP.

[16]  Qingming Huang,et al.  Highlight Summarization in Sports Video Based on Replay Detection , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[17]  J. Gross,et al.  Emotion elicitation using films , 1995 .

[18]  Junqing Yu,et al.  Video Affective Content Representation and Recognition Using Video Affective Tree and Hidden Markov Models , 2007, ACII.

[19]  A. Dickinson,et al.  Neuronal coding of prediction errors. , 2000, Annual review of neuroscience.

[20]  A. Mehrabian Pleasure-arousal-dominance: A general framework for describing and measuring individual differences in Temperament , 1996 .

[21]  Thomas Hofmann,et al.  The Cluster-Abstraction Model: Unsupervised Learning of Topic Hierarchies from Text Data , 1999, IJCAI.

[22]  J. Russell A circumplex model of affect. , 1980 .

[23]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[24]  Thomas Hofmann,et al.  Topic-based language models using EM , 1999, EUROSPEECH.

[25]  Michael R. Lyu,et al.  Video Summarization Using Greedy Method in a Constraint Satisfaction Framework , 2003 .

[26]  J. M. Anderson,et al.  Responses of human frontal cortex to surprising events are predicted by formal associative learning theory , 2001, Nature Neuroscience.

[27]  Yukinobu Taniguchi,et al.  Structured Video Computing , 1994, IEEE MultiMedia.

[28]  Peter Y. K. Cheung,et al.  Affective Level Video Segmentation by Utilizing the Pleasure-Arousal-Dominance Information , 2008, IEEE Transactions on Multimedia.

[29]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[30]  Mohan S. Kankanhalli,et al.  Automatic music video summarization based on audio-visual-text analysis and alignment , 2005, SIGIR '05.

[31]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[32]  Surya Nepal,et al.  Automatic detection of 'Goal' segments in basketball videos , 2001, MULTIMEDIA '01.

[33]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[34]  Alan Hanjalic,et al.  Affective video content representation and modeling , 2005, IEEE Transactions on Multimedia.

[35]  Eric Horvitz,et al.  Layered representations for learning and inferring office activity from multiple sensory channels , 2004, Comput. Vis. Image Underst..

[36]  Kiyoharu Aizawa,et al.  Affective video segment retrieval for consumer generated videos based on correlation between emotions and emotional audio events , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[37]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[38]  Hang-Bong Kang,et al.  Affective content detection using HMMs , 2003, ACM Multimedia.

[39]  Yoram Singer,et al.  The Hierarchical Hidden Markov Model: Analysis and Applications , 1998, Machine Learning.

[40]  Peter Y. K. Cheung,et al.  A computation method for video segmentation utilizing the pleasure-arousal-dominance emotional information , 2007, ACM Multimedia.

[41]  Fabrice Souvannavong,et al.  Latent semantic analysis for an effective region-based video shot retrieval system , 2004, MIR '04.

[42]  Qingwen Dong,et al.  The Effects of Emotional Arousal and Valence on Television Viewers' Cognitive Capacity and Memory. , 1995 .

[43]  Ling-Yu Duan,et al.  Hierarchical movie affective content analysis based on arousal and valence features , 2008, ACM Multimedia.