Topic Time Series Analysis of Microblogs

IMA Journal of Applied Mathematics (2016) 81, 409–431 doi:10.1093/imamat/hxw025 Advance Access publication on 13 July 2016 Topic time series analysis of microblogs and P. Jeffrey Brantingham Department of Anthropology, UCLA, Los Angeles, CA 90095, USA [Received on 24 April 2016] Social media data tend to cluster around events and themes. Local newsworthy events, sports team victories or defeats, abnormal weather patterns and globally trending topics all influence the content of online discussion. The automated discovery of these underlying themes from corpora of text is of interest to numerous academic fields as well as to law enforcement organizations and commercial users. One useful class of tools to deal with such problems are topic models, which attempt to recover latent groups of word associations from the text. However, it is clear that these topics may also exhibit patterns in both time and space. The recovery of such patterns complements the analysis of the text itself and in many cases provides additional context. In this work we describe two methods for mining interesting spatio-temporal dynamics and relations among topics, one that compares the topic distributions as histograms in space and time and another that models topics over time as temporal or spatio-temporal Hawkes process with exponential trigger functions. Both methods may be used to discover topics with abnormal distributions in space and time. The second method also allows for self-exciting topics and can recover intertopic relationships (excitation or inhibition) in both time and space. We apply these methods to a geo-tagged Twitter dataset and provide analysis and discussion of the results. Keywords: mining complex datasets; spatial and temporal analysis; topic modeling; cluster analysis. ©The authors 2016. Published by Oxford University Press on behalf of the Institute of Mathematics and its Applications. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Downloaded from http://imamat.oxfordjournals.org/ at :: on November 7, 2016 Eric L. Lai Department of Mathematics, UCI, Irvine, CA 92697, USA Daniel Moyer Department of Computer Science, University of Southern California, Los Angeles, CA 90033, USA and Department of Mathematics, UCLA, Los Angeles, CA 90095, USA Baichuan Yuan Department of Mathematics, Zhejiang University, Hangzhou 310027, China and Department of Mathematics, UCLA, Los Angeles, CA 90095, USA Eric Fox Department of Statistics, UCLA, Los Angeles, CA 90095, USA Blake Hunter Mathematical Sciences, Claremont Mckenna College, Claremont, CA 91711, USA and Department of Mathematics, UCLA, Los Angeles, CA 90095, USA Andrea L. Bertozzi ∗ Department of Mathematics, UCLA, Los Angeles, CA 90095, USA Corresponding author: bertozzi@math.ucla.edu

[1]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[2]  Seungjin Choi,et al.  Semi-Supervised Nonnegative Matrix Factorization , 2010, IEEE Signal Processing Letters.

[3]  P. Meyer,et al.  Demonstration simplifiee d'un theoreme de Knight , 1971 .

[4]  David W. Jacobs,et al.  Approximate earth mover’s distance in linear time , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Ricardo Vilalta,et al.  Predicting rare events in temporal domains , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[6]  Le Song,et al.  Learning Social Infectivity in Sparse Low-rank Networks Using Multi-dimensional Hawkes Processes , 2013, AISTATS.

[7]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[8]  D. Sornette,et al.  Endogenous versus exogenous shocks in systems with memory , 2002, cond-mat/0206047.

[9]  Erik A. Lewis,et al.  Self-exciting point process models of civilian deaths in Iraq , 2011, Security Journal.

[10]  A. Stomakhin,et al.  Reconstruction of missing data in social networks based on temporal patterns of interactions , 2011 .

[11]  Bin Zhou,et al.  Measuring the spreadability of users in microblogs , 2013, Journal of Zhejiang University SCIENCE C.

[12]  George E. Tita,et al.  Self-Exciting Point Process Modeling of Crime , 2011 .

[13]  Chris H. Q. Ding,et al.  On the equivalence between Non-negative Matrix Factorization and Probabilistic Latent Semantic Indexing , 2008, Comput. Stat. Data Anal..

[14]  Hongfei Yan,et al.  Comparing Twitter and Traditional Media Using Topic Models , 2011, ECIR.

[15]  Peter Grindrod,et al.  Mathematical Underpinnings of Analytics: Theory and Applications , 2015 .

[16]  A. Veen,et al.  Estimation of Space–Time Branching Process Models in Seismology Using an EM–Type Algorithm , 2006 .

[17]  Katherine A. Heller,et al.  Modelling Reciprocating Relationships with Hawkes Processes , 2012, NIPS.

[18]  Edward A. Fox,et al.  Research Contributions , 2014 .

[19]  Cecilia Mascolo,et al.  An Empirical Study of Geographic User Activity Patterns in Foursquare , 2011, ICWSM.

[20]  Wesley De Neve,et al.  Using topic models for Twitter hashtag recommendation , 2013, WWW.

[21]  Albert-László Barabási,et al.  Understanding individual human mobility patterns , 2008, Nature.

[22]  Erik A. Lewis,et al.  RESEARCH ARTICLE A Nonparametric EM algorithm for Multiscale Hawkes Processes , 2011 .

[23]  Ulrike Goldschmidt,et al.  An Introduction To The Theory Of Point Processes , 2016 .

[24]  Simon Urbanek,et al.  Unsupervised clustering of multidimensional distributions using earth mover distance , 2011, KDD.

[25]  J T Woodworth,et al.  Non-local crime density estimation incorporating housing information , 2014, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[26]  Haym Hirsh,et al.  Learning to Predict Extremely Rare Events , 2000 .

[27]  Brian D. Davison,et al.  Empirical study of topic modeling in Twitter , 2010, SOMA '10.

[28]  P. Guttorp,et al.  Point Processes, Spatial‐TemporalBased in part on the article “Point processes, spatial‐temporal” by Frederic Paik Schoenberg, David R. Brillinger, and Peter Guttorp, which appeared in the Encyclopedia of Environmetrics. , 2013 .

[29]  Hyunsoo Kim,et al.  Nonnegative Matrix Factorization Based on Alternating Nonnegativity Constrained Least Squares and Active Set Method , 2008, SIAM J. Matrix Anal. Appl..

[30]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[31]  Shirley Williams,et al.  What do people study when they study Twitter? Classifying Twitter related academic papers , 2013, J. Documentation.

[32]  Frederic Paik Schoenberg,et al.  On Rescaled Poisson Processes and the Brownian Bridge , 2002 .

[33]  Hendrik Blockeel,et al.  Web mining research: a survey , 2000, SKDD.

[34]  J. Mateu,et al.  A third‐order point process characteristic for multi‐type point processes , 2010 .

[35]  Daryl J. Daley,et al.  An Introduction to the Theory of Point Processes , 2013 .

[36]  James R. Foulds,et al.  HawkesTopic: A Joint Model for Network Inference and Topic Modeling from Text-Based Cascades , 2015, ICML.

[37]  Timothy W. Finin,et al.  Why we twitter: understanding microblogging usage and communities , 2007, WebKDD/SNA-KDD '07.

[38]  Mario Cataldi,et al.  Emerging topic detection on Twitter based on temporal and social terms evaluation , 2010, MDMKDD '10.

[39]  Donald L. Snyder,et al.  Self-Exciting Point Processes , 1991 .

[40]  H. Akaike A new look at the statistical model identification , 1974 .

[41]  Vikas Sindhwani,et al.  Learning evolving and emerging topics in social media: a dynamic nmf approach with temporal regularization , 2012, WSDM '12.

[42]  Susan T. Dumais,et al.  Characterizing Microblogs with Topic Models , 2010, ICWSM.

[43]  Aram Galstyan,et al.  Information transfer in social media , 2011, WWW.

[44]  Jeffrey D. Scargle,et al.  An Introduction to the Theory of Point Processes, Vol. I: Elementary Theory and Methods , 2004, Technometrics.

[45]  A. Hawkes Spectra of some self-exciting and mutually exciting point processes , 1971 .

[46]  Andrea L. Bertozzi,et al.  Modeling E-mail Networks and Inferring Leadership Using Self-Exciting Point Processes , 2016 .

[47]  Yosihiko Ogata,et al.  Statistical Models for Earthquake Occurrences and Residual Analysis for Point Processes , 1988 .

[48]  George Mohler,et al.  Marked point process hotspot maps for homicide and gun crime prediction in Chicago , 2014 .

[49]  Peter Guttorp,et al.  Point Processes, Spatial-Temporal†‡ , 2014 .

[50]  Jiawei Han,et al.  Geographical topic discovery and comparison , 2011, WWW.

[51]  M. Muskulus,et al.  Wasserstein distances in the analysis of time series and dynamical systems , 2011 .

[52]  Alexander J. Smola,et al.  Discovering geographical topics in the twitter stream , 2012, WWW.

[53]  Haesun Park,et al.  Sparse Nonnegative Matrix Factorization for Clustering , 2008 .