A Case Study of Deep Learning Based Multi-Modal Methods for Predicting the Age-Suitability Rating of Movie Trailers

In this work, we explore different approaches to combine modalities for the problem of automated age-suitability rating of movie trailers. First, we introduce a new dataset containing videos of movie trailers in English downloaded from IMDB and YouTube, along with their corresponding age-suitability rating labels. Secondly, we propose a multi-modal deep learning pipeline addressing the movie trailer age suitability rating problem. This is the first attempt to combine video, audio, and speech information for this problem, and our experimental results show that multi-modal approaches significantly outperform the best mono and bimodal models in this task.

[1]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  Fillia Makedon,et al.  Deep Visual Attributes vs. Hand-Crafted Audio Features on Multidomain Speech Emotion Recognition , 2017, Comput..

[3]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Joakim Andén,et al.  Multiscale Scattering for Audio Classification , 2011, ISMIR.

[5]  King-Sun Fu,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Publication Information , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  V. Tiwari MFCC and its applications in speaker recognition , 2010 .

[7]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Saif Mohammad,et al.  From Once Upon a Time to Happily Ever After: Tracking Emotions in Novels and Fairy Tales , 2011, LaTeCH@ACL.

[9]  Vicente Ordonez,et al.  Moviescope: Large-scale Analysis of Movies using Multiple Modalities , 2019, ArXiv.

[10]  Thamar Solorio,et al.  Attending the Emotions to Detect Online Abusive Language , 2020, ALW.

[11]  Guillaume Gravier,et al.  Affect in Multimedia: Benchmarking Violent Scenes Detection , 2022, IEEE Transactions on Affective Computing.

[12]  Iyad Rahwan,et al.  Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm , 2017, EMNLP.

[13]  Shrikanth S. Narayanan,et al.  Violence Rating Prediction from Movie Scripts , 2019, AAAI.

[14]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[15]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[17]  Barbara J. Wilson Media and Children's Aggression, Fear, and Altruism , 2008, The Future of children.

[18]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[19]  Björn W. Schuller,et al.  Large-scale audio feature extraction and SVM for acoustic scene classification , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[20]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Markus Schedl,et al.  Benchmarking Violent Scenes Detection in movies , 2014, 2014 12th International Workshop on Content-Based Multimedia Indexing (CBMI).

[22]  Thamar Solorio,et al.  Rating for Parents: Predicting Children Suitability Rating for Movies Based on Language of the Movies , 2019, ArXiv.

[23]  V. Strasburger,et al.  Adolescent sexuality and the media. , 1989, Pediatric clinics of North America.

[24]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Mubarak Shah,et al.  Movie genre classification by exploiting audio-visual features of previews , 2002, Object recognition supported by user interaction for service robots.

[26]  Colin Raffel,et al.  librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[27]  Fabio A. González,et al.  Gated Multimodal Units for Information Fusion , 2017, ICLR.

[28]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[29]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[30]  Shrikanth S. Narayanan,et al.  Improving Gender Identification in Movie Audio Using Cross-Domain Data , 2018, INTERSPEECH.

[31]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Gerhard Widmer,et al.  The Receptive Field as a Regularizer in Deep Convolutional Neural Networks for Acoustic Scene Classification , 2019, 2019 27th European Signal Processing Conference (EUSIPCO).

[33]  Jun Li,et al.  Fast Film Genres Classification Combining Poster and Synopsis , 2015, IScIDE.

[34]  Justin H. Chang,et al.  Effect of Exposure to Gun Violence in Video Games on Children’s Dangerous Behavior With Real Guns , 2019, JAMA network open.