Early Prediction of Movie Box Office Success Based on Wikipedia Activity Big Data

Use of socially generated “big data” to access information about collective states of the minds in human societies has become a new paradigm in the emerging field of computational social science. A natural application of this would be the prediction of the society's reaction to a new product in the sense of popularity and adoption rate. However, bridging the gap between “real time monitoring” and “early predicting” remains a big challenge. Here we report on an endeavor to build a minimalistic predictive model for the financial success of movies based on collective activity data of online users. We show that the popularity of a movie can be predicted much before its release by measuring and analyzing the activity level of editors and viewers of the corresponding entry to the movie in Wikipedia, the well-known online encyclopedia.

[1]  Mark Graham,et al.  The most controversial topics in Wikipedia: A multilingual and geographical analysis , 2013, ArXiv.

[2]  M. de Rijke,et al.  Predicting the volume of comments on online news stories , 2009, CIKM.

[3]  Carlos Rodriguez-Sickert,et al.  The effect of social interactions in the primary consumption life cycle of motion pictures , 2005, cond-mat/0501059.

[4]  H Eugene Stanley,et al.  Complex dynamics of our economic life on different scales: insights from search engine query data , 2010, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[5]  Royce Kimmons Understanding collaboration in Wikipedia , 2011, First Monday.

[6]  Finn Årup Nielsen,et al.  Wikipedia research and tools: Review and comments , 2012 .

[7]  Isabell M. Welpe,et al.  Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment , 2010, ICWSM.

[8]  András Kornai,et al.  A Practical Approach to Language Complexity: A Wikipedia Case Study , 2012, PloS one.

[9]  Maarten de Rijke,et al.  News Comments: Exploring, Modeling, and Online Prediction , 2010, ECIR.

[10]  Peter A. Gloor,et al.  The web mirrors value in the real world: comparing a firm’s valuation with its web network position , 2015, Comput. Math. Organ. Theory.

[11]  J. Voß Measuring Wikipedia , 2005 .

[12]  Derek Lackaff,et al.  An Analysis of Topical Coverage of Wikipedia , 2008, J. Comput. Mediat. Commun..

[13]  Nicolas Jullien,et al.  What We Know About Wikipedia: A Review of the Literature Analyzing the Project(s) , 2012 .

[14]  Sitabhra Sinha,et al.  Hollywood blockbusters and long-tailed distributions: An empirical study of the popularity of movies , 2004 .

[15]  Maxi San Miguel,et al.  Opinions, Conflicts and Consensus: Modeling Social Dynamics in a Collaborative Environment , 2012, Physical review letters.

[16]  András Kornai,et al.  Edit Wars in Wikipedia , 2011, 2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing.

[17]  Bernardo A. Huberman,et al.  Predicting the Future with Social Media , 2010, 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[18]  Sitabhra Sinha,et al.  Hollywood blockbusters and long-tailed distributions , 2004, cond-mat/0406726.

[19]  H. Eugene Stanley,et al.  Quantifying the Advantage of Looking Forward , 2012, Scientific Reports.

[20]  Mung Chiang,et al.  Why watching movie tweets won't tell the whole story? , 2012, WOSN '12.

[21]  Daniel Gayo-Avello,et al.  "I Wanted to Predict Elections with Twitter and all I got was this Lousy Paper" - A Balanced Survey on Election Prediction using Twitter Data , 2012, ArXiv.

[22]  M. Tsagkias,et al.  Mining social media: tracking content and predicting behavior , 2012 .

[23]  Taha Yasseri,et al.  Value Production in a Collaborative Environment , 2012, Journal of Statistical Physics.

[24]  Panagiotis Takis Metaxas,et al.  Limits of Electoral Predictions Using Twitter , 2011, ICWSM.

[25]  Yutaka Matsuo,et al.  Semantic Twitter: Analyzing Tweets for Real-Time Event Notification , 2008, BlogTalk.

[26]  Jürgen Pfeffer,et al.  Characterizing the life cycle of online news stories using social media reactions , 2013, CSCW.

[27]  Darren Gergle,et al.  Hot off the wiki: dynamics, practices, and structures in Wikipedia's coverage of the Tōhoku catastrophes , 2011, Int. Sym. Wikis.

[28]  Ramesh Sharda,et al.  Predicting box-office success of motion pictures with neural networks , 2006 .

[29]  Anselm Spoerri,et al.  What is popular on Wikipedia and why? , 2007, First Monday.

[30]  Jordi Duch,et al.  Tracking Traders' Understanding of the Market Using e-Communication Data , 2011, PloS one.

[31]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[32]  Gunther Eysenbach,et al.  Can Tweets Predict Citations? Metrics of Social Impact Based on Twitter and Correlation with Traditional Metrics of Scientific Impact , 2011, Journal of medical Internet research.

[33]  Taha Yasseri,et al.  Value Production in a Collaborative Environment , 2013, Journal of Statistical Physics.

[34]  Katy Börner,et al.  Analyzing and visualizing the semantic coverage of Wikipedia and its authors , 2005, Complex..

[35]  Dario Taraborelli,et al.  Beyond Notability. Collective Deliberation on Content Inclusion in Wikipedia , 2010, 2010 Fourth IEEE International Conference on Self-Adaptive and Self-Organizing Systems Workshop.

[36]  Peng Qi,et al.  The Evolution of Wikipedia , 2013 .

[37]  Gilad Mishne,et al.  Predicting Movie Sales from Blogger Sentiment , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[38]  Anselm Spoerri Visualizing the overlap between the 100 most visited pages on Wikipedia for September 2006 to January 2007 , 2007, First Monday.

[39]  Jeremy Ginsberg,et al.  Detecting influenza epidemics using search engine query data , 2009, Nature.

[40]  P. Ingwersen,et al.  Proceedings of ISSI 2005 – The 10th International Conference of the International Society for Scientometrics and Informetrics: Stockholm, Sweden, July 24-28, 2005 , 2005 .

[41]  Santo Fortunato,et al.  Characterizing and modeling the dynamics of online popularity , 2010, Physical review letters.

[42]  Johan Bollen,et al.  Twitter mood predicts the stock market , 2010, J. Comput. Sci..

[43]  Sameet Sreenivasan,et al.  Quantitative analysis of the evolution of novelty in cinema through crowdsourced keywords , 2013, Scientific Reports.

[44]  Johan Bollen,et al.  How the Scientific Community Reacts to Newly Submitted Preprints: Article Downloads, Twitter Mentions, and Citations , 2012, PloS one.

[45]  M. Osborne,et al.  Bieber no more : First Story Detection using Twitter and Wikipedia , 2012 .

[46]  András Kornai,et al.  Dynamics of Conflicts in Wikipedia , 2012, PloS one.

[47]  Taha Yasseri,et al.  Circadian Patterns of Wikipedia Editorial Activity: A Demographic Analysis , 2011, PloS one.

[48]  Hisashi Arakaki,et al.  The ‘hit’ phenomenon: a mathematical model of human dynamics interactions as a stochastic process , 2012 .

[49]  R. Pan,et al.  Blockbusters, Bombs and Sleepers: The Income Distribution of Movies , 2005, physics/0504198.

[50]  M. de Rijke,et al.  Predicting IMDB Movie Ratings Using Social Media , 2012, ECIR.

[51]  András Kornai,et al.  Characterization and prediction of Wikipedia edit wars , 2011 .

[52]  Wolfgang Nejdl,et al.  Extracting Event-Related Information from Article Updates in Wikipedia , 2013, ECIR.

[53]  H. Eugene Stanley,et al.  Quantifying Wikipedia Usage Patterns Before Stock Market Moves , 2013, Scientific Reports.

[54]  Ed H. Chi,et al.  The singularity is not near: slowing growth of Wikipedia , 2009, Int. Sym. Wikis.

[55]  Raj Kumar Pan,et al.  The statistical laws of popularity: universal properties of the box-office dynamics of motion pictures , 2010, 1010.2634.

[56]  Stevan Harnad,et al.  Earlier Web Usage Statistics as Predictors of Later Citation Impact , 2005, J. Assoc. Inf. Sci. Technol..

[57]  Noah A. Smith,et al.  Movie Reviews and Revenues: An Experiment in Text Regression , 2010, NAACL.