Finding the Storyteller: Automatic Spoiler Tagging using Linguistic Cues

Given a movie comment, does it contain a spoiler? A spoiler is a comment that, when disclosed, would ruin a surprise or reveal an important plot detail. We study automatic methods to detect comments and reviews that contain spoilers and apply them to reviews from the IMDB (Internet Movie Database) website. We develop topic models, based on Latent Dirichlet Allocation (LDA), but using linguistic dependency information in place of simple features from bag of words (BOW) representations. Experimental results demonstrate the effectiveness of our technique over four movie-comment datasets of different scales.

[1]  Yee Whye Teh,et al.  A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation , 2006, NIPS.

[2]  Gregor Heinrich Parameter estimation for text analysis , 2009 .

[3]  Hanna Wallach,et al.  Structured Topic Models for Language , 2008 .

[4]  A. McCallum,et al.  Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[5]  Andrew McCallum,et al.  A Note on Topical N-grams , 2005 .

[6]  Thomas L. Griffiths,et al.  Probabilistic author-topic models for information discovery , 2004, KDD.

[7]  John D. Lafferty,et al.  A correlated topic model of Science , 2007, 0708.3601.

[8]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[9]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[10]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[11]  W. Bruce Croft,et al.  Search Engines - Information Retrieval in Practice , 2009 .

[12]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[13]  Stefan Evert,et al.  Methods for the Qualitative Evaluation of Lexical Association Measures , 2001, ACL.

[14]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[15]  Mark Steyvers,et al.  Topics in semantic representation. , 2007, Psychological review.

[16]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Hanna M. Wallach,et al.  Topic modeling: beyond bag-of-words , 2006, ICML.

[18]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[19]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[20]  Pavel Pecina,et al.  Combining Association Measures for Collocation Extraction , 2006, ACL.

[21]  ChengXiang Zhai,et al.  Automatic labeling of multinomial topic models , 2007, KDD '07.

[22]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[23]  Philip S. Yu,et al.  Building text classifiers using positive and unlabeled examples , 2003, Third IEEE International Conference on Data Mining.

[24]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[25]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[26]  Thomas L. Griffiths,et al.  Integrating Topics and Syntax , 2004, NIPS.