A First Dataset for Film Age Appropriateness Investigation

Film age appropriateness classification is an important problem with a significant societal impact that has so far been out of the interest of Natural Language Processing and Machine Learning researchers. To this end, we have collected a corpus of 17000 films along with their age ratings. We use the textual contents in an experiment to predict the correct age classification for the United States (G, PG, PG-13, R and NC-17) and the United Kingdom (U, PG, 12A, 15, 18 and R18). Our experiments indicate that gradient boosting machines beat FastText and various Deep Learning architectures. We reach an overall accuracy of 79.3% for the US ratings compared to a projected super human accuracy of 84%. For the UK ratings, we reach an overall accuracy of 65.3% (UK) compared to a projected super human accuracy of 80.0%.

[1]  Thamar Solorio,et al.  Folksonomication: Predicting Tags for Movies from Plot Synopses using Emotion Flow Encoded Neural Network , 2018, COLING.

[2]  Jörg Tiedemann,et al.  OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles , 2016, LREC.

[3]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[4]  Vishal Batchu,et al.  Predicting the Genre and Rating of a Movie Based on its Synopsis , 2018, PACLIC.

[5]  Yuji Matsumoto,et al.  EMTC: Multilabel Corpus in Movie Domain for Emotion Analysis in Conversational Text , 2018, LREC.

[6]  Mirella Lapata,et al.  Movie Script Summarization as Graph-based Scene Extraction , 2015, NAACL.

[7]  Maksims Volkovs,et al.  Content-based Neighbor Models for Cold Start in Recommender Systems , 2017, RecSys 2017.

[8]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[9]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[10]  Vlad Sandulescu,et al.  Predicting the future relevance of research institutions - The winning solution of the KDD Cup 2016 , 2016, ArXiv.

[11]  P. Nather N-Gram based Text Categorization , 2005 .

[12]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[13]  Rafael E. Banchs Movie-DiC: a Movie Dialogue Corpus for Research and Development , 2012, ACL.

[14]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[15]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[16]  Thamar Solorio,et al.  MPST: A Corpus of Movie Plot Synopses with Tags , 2018, LREC.

[17]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.