Cross-evaluation of metrics to estimate the significance of creative works

Significance Whether it is Hollywood movies or research papers, identifying works of great significance is imperative in a modern society overflowing with information. Through analysis of a network constructed from citations between films as referenced in the Internet Movie Database, we obtain several automated metrics for significance. We find that the best automated method can identify significant films, represented by selection to the US National Film Registry, at least as well as the aggregate rating of many experts and far better than the rating of a single expert. We hypothesize that these results may hold for other creative works. In a world overflowing with creative works, it is useful to be able to filter out the unimportant works so that the significant ones can be identified and thereby absorbed. An automated method could provide an objective approach for evaluating the significance of works on a universal scale. However, there have been few attempts at creating such a measure, and there are few “ground truths” for validating the effectiveness of potential metrics for significance. For movies, the US Library of Congress’s National Film Registry (NFR) contains American films that are “culturally, historically, or aesthetically significant” as chosen through a careful evaluation and deliberation process. By analyzing a network of citations between 15,425 United States-produced films procured from the Internet Movie Database (IMDb), we obtain several automated metrics for significance. The best of these metrics is able to indicate a film’s presence in the NFR at least as well or better than metrics based on aggregated expert opinions or large population surveys. Importantly, automated metrics can easily be applied to older films for which no other rating may be available. Our results may have implications for the evaluation of other creative works such as scientific research.

[1]  M. Zweig,et al.  Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. , 1993, Clinical chemistry.

[2]  Claudio Castellano,et al.  Universality of citation distributions: Toward an objective measure of scientific impact , 2008, Proceedings of the National Academy of Sciences.

[3]  J. Heckman Sample selection bias as a specification error , 1979 .

[4]  Tue Tjur,et al.  Coefficients of Determination in Logistic Regression Models—A New Proposal: The Coefficient of Discrimination , 2009 .

[5]  D. Borsboom,et al.  The Theoretical Status of Latent Variables , 2003 .

[6]  S. Redner Citation statistics from 110 years of physical review , 2005, physics/0506056.

[7]  Maaret Koskinen,et al.  Ingmar Bergman: the life and films of the last great European director , 2010 .

[8]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[9]  M. Newman,et al.  On the uniform generation of random graphs with prescribed degree sequences , 2003, cond-mat/0312028.

[10]  Alexander Davis,et al.  Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb , 2017 .

[11]  S. Shen-Orr,et al.  Network motifs: simple building blocks of complex networks. , 2002, Science.

[12]  Andrew B. Whinston,et al.  Whose and what chatter matters? The effect of tweets on movie sales , 2013, Decis. Support Syst..

[13]  D J PRICE,et al.  NETWORKS OF SCIENTIFIC PAPERS. , 1965, Science.

[14]  S. Redner How popular is your paper? An empirical study of the citation distribution , 1998, cond-mat/9804163.

[15]  J. Kaufman,et al.  Snow White and the Seven Dwarfs , 2019, Claiming the Mantle.

[16]  K. Sneppen,et al.  Specificity and Stability in Topology of Protein Networks , 2002, Science.

[17]  Geoffrey Macnab Ingmar Bergman: The Life and Films of the Last Great European Director , 2009 .

[18]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[19]  Derek de Solla Price,et al.  A general theory of bibliometric and other cumulative advantage processes , 1976, J. Am. Soc. Inf. Sci..

[20]  E. Garfield Citation analysis as a tool in journal evaluation. , 1972, Science.

[21]  Filippo Radicchi,et al.  Correlations between user voting data, budget, and box office for films in the internet movie database , 2013, J. Assoc. Inf. Sci. Technol..

[22]  Mark E. J. Newman,et al.  The Structure and Function of Complex Networks , 2003, SIAM Rev..

[23]  J. Heckman The Common Structure of Statistical Models of Truncation, Sample Selection and Limited Dependent Variables and a Simple Estimator for Such Models , 1976 .

[24]  Jeffrey D Sachs A global coalition of good. Whether doing surgeries in Africa or bringing dance to tough local neighborhoods, many are helping. , 2007, Time.

[25]  D. McFadden Conditional logit analysis of qualitative choice behavior , 1972 .

[26]  Chaomei Chen,et al.  Tracing knowledge diffusion , 2004, Scientometrics.

[27]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[28]  M E J Newman,et al.  Random graph models for directed acyclic networks. , 2009, Physical review. E, Statistical, nonlinear, and soft matter physics.

[29]  Filippo Radicchi,et al.  The Possible Role of Resource Requirements and Academic Career-Choice Risk on Gender Differences in Publication Rate and Impact , 2012, PloS one.

[30]  E. Garfield The history and meaning of the journal impact factor. , 2006, JAMA.

[31]  Ben Walters,et al.  The Evil Dead (Motion picture) , 2002 .

[32]  C. J. Carstens,et al.  Motifs in Directed Acyclic Networks , 2013, 2013 International Conference on Signal-Image Technology & Internet-Based Systems.

[33]  Sergei Maslov,et al.  Finding scientific gems with Google's PageRank algorithm , 2006, J. Informetrics.

[34]  Arne Henningsen,et al.  Sample Selection Models in R: Package sampleSelection , 2008 .

[35]  Per O. Seglen,et al.  The Skewness of Science , 1992, J. Am. Soc. Inf. Sci..

[36]  Taha Yasseri,et al.  Early Prediction of Movie Box Office Success Based on Wikipedia Activity Big Data , 2012, PloS one.