Fast mining and forecasting of complex time-stamped events

Given huge collections of time-evolving events such as web-click logs, which consist of multiple attributes (e.g., URL, userID, times- tamp), how do we find patterns and trends? How do we go about capturing daily patterns and forecasting future events? We need two properties: (a) effectiveness, that is, the patterns should help us understand the data, discover groups, and enable forecasting, and (b) scalability, that is, the method should be linear with the data size. We introduce TriMine, which performs three-way mining for all three attributes, namely, URLs, users, and time. Specifically TriMine discovers hidden topics, groups of URLs, and groups of users, simultaneously. Thanks to its concise but effective summarization, it makes it possible to accomplish the most challenging and important task, namely, to forecast future events. Extensive experiments on real datasets demonstrate that TriMine discovers meaningful topics and makes long-range forecasts, which are notoriously difficult to achieve. In fact, TriMine consistently outperforms the best state-of-the-art existing methods in terms of accuracy and execution speed (up to 74x faster).

[1]  R. Bro,et al.  PARAFAC and missing values , 2005 .

[2]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[3]  Christos Faloutsos,et al.  Adaptive, Hands-Off Stream Mining , 2003, VLDB.

[4]  Elisa Bertino,et al.  Tagging Stream Data for Rich Real-Time Services , 2009, Proc. VLDB Endow..

[5]  Brian D. Davison,et al.  Tracking trends: incorporating term volume into temporal topic models , 2011, KDD.

[6]  Deepak Agarwal,et al.  Spatio-temporal models for estimating click-through rate , 2009, WWW '09.

[7]  Anthony K. H. Tung,et al.  Estimating local optimums in EM algorithm over Gaussian mixture model , 2008, ICML '08.

[8]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[9]  Naonori Ueda,et al.  Topic Tracking Model for Analyzing Consumer Purchase Behavior , 2009, IJCAI.

[10]  Masatoshi Yoshikawa,et al.  Scalable Algorithms for Distribution Search , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[11]  S. Muthukrishnan,et al.  Modeling skew in data streams , 2006, SIGMOD Conference.

[12]  Mikhail Belkin,et al.  Data spectroscopy: learning mixture models using eigenspaces of convolution operators , 2008, ICML '08.

[13]  Philip S. Yu,et al.  Optimal multi-scale patterns in time series streams , 2006, SIGMOD Conference.

[14]  Christos Faloutsos,et al.  Stream Monitoring under the Time Warping Distance , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[15]  Andrew McCallum,et al.  Efficient methods for topic model inference on streaming document collections , 2009, KDD.

[16]  Daniel Barbará,et al.  On-line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[17]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[18]  Andreas S. Weigend,et al.  Time Series Prediction: Forecasting the Future and Understanding the Past , 1994 .

[19]  Christos Faloutsos,et al.  Parsimonious linear fingerprinting for time series , 2010, Proc. VLDB Endow..

[20]  Santosh S. Vempala,et al.  The Spectral Method for General Mixture Models , 2005, COLT.

[21]  Joos Vandewalle,et al.  A Multilinear Singular Value Decomposition , 2000, SIAM J. Matrix Anal. Appl..

[22]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[23]  Max Welling,et al.  Fast collapsed gibbs sampling for latent dirichlet allocation , 2008, KDD.

[24]  Christos Faloutsos,et al.  BRAID: stream mining through group lag correlations , 2005, SIGMOD '05.

[25]  Jimeng Sun,et al.  Dynamic Mixture Models for Multiple Time-Series , 2007, IJCAI.

[26]  Yasushi Sakurai,et al.  Online multiscale dynamic topic models , 2010, KDD.

[27]  Tamara G. Kolda,et al.  Higher-order Web link analysis using multilinear algebra , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[28]  Lars Schmidt-Thieme,et al.  Learning optimal ranking with tensor factorization for tag recommendation , 2009, KDD.