Towards Longitudinal Analytics on Social Media Data

We are witnessing increasing interests in longitudinal analytics on social media data. Longitudinal analytics takes into account an interval and considers the temporal popularity of social media data in the interval, rather than only considering recently generated social media data in real-time search. We study a fundamental functionality in longitudinal analytics—the top-k temporal keyword (TkTK) querying. A TkTK query takes as input a set of query keywords and an interval, and returns the top-k most significant social items, e.g., tweets, where the significance of a social item is defined based on a combination of the textual relevance and temporal popularity. We model social media data as a forest of linkage trees along the time dimension, which well models the propagation processes, e.g., replies and forwards, among different social items. Based on the forest, we model the temporal popularity of a social item across time as a popularity time series. We design two indexing structures that index social items' popularity time series and textual content in a holistic manner—the temporal popularity inverted index (TPII) and the log-structured merge octree (LSMO). Empirical studies with two substantial social media data sets offer insight into the design properties of the indexes and confirm that LSMO enables both efficient query processing and indexing structure updates.

[1]  Bernhard Sick,et al.  Online Segmentation of Time Series Based on Polynomial Least-Squares Approximations , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS.

[3]  Aoying Zhou,et al.  Top-k temporal keyword search over social media data , 2016, World Wide Web.

[4]  Torsten Suel,et al.  Faster temporal range queries over versioned text , 2011, SIGIR '11.

[5]  Srikanta J. Bedathur,et al.  Efficient temporal keyword search over versioned text , 2010, CIKM.

[6]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[7]  Aoying Zhou,et al.  Towards modeling popularity of microblogs , 2013, Frontiers of Computer Science.

[8]  Hermann Tropf,et al.  Multimensional Range Search in Dynamically Balanced Trees , 1981, Angew. Inform..

[9]  Leon Derczynski,et al.  Towards context-aware search and analysis on social media data , 2013, EDBT '13.

[10]  Hanan Samet,et al.  Speeding up construction of PMR quadtree-based spatial indexes , 2002, The VLDB Journal.

[11]  Eamonn J. Keogh,et al.  An online algorithm for segmenting time series , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[12]  Xiaokui Xiao,et al.  LSII: An indexing structure for exact real-time search on microblogs , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[13]  Bin Yang,et al.  Correlated Time Series Forecasting using Multi-Task Deep Neural Networks , 2018, CIKM.

[14]  Xuemin Lin,et al.  Spatial Keyword Range Search on Trajectories , 2015, DASFAA.

[15]  Christian S. Jensen,et al.  Travel Cost Inference from Sparse, Spatio-Temporally Correlated Time Series Using Markov Models , 2013, Proc. VLDB Endow..

[16]  Feifei Li,et al.  Top-k queries on temporal data , 2010, The VLDB Journal.

[17]  Christian S. Jensen,et al.  Risk-aware path selection with time-varying, uncertain travel costs: a time series approach , 2018, The VLDB Journal.

[18]  Bin Yang,et al.  Enabling Smart Transportation Systems: A Parallel Spatio-Temporal Database Approach , 2016, IEEE Transactions on Computers.

[19]  Jeffrey Scott Vitter,et al.  Optimal External Memory Interval Management , 2003, SIAM J. Comput..

[20]  Daniel Lemire,et al.  A Better Alternative to Piecewise Linear Time Series Segmentation , 2006, SDM.

[21]  Vassilis J. Tsotras,et al.  A Comparison of Top-k Temporal Keyword Querying over Versioned Text Collections , 2012, DEXA.

[22]  Christian S. Jensen,et al.  Outlier Detection for Multidimensional Time Series Using Deep Neural Networks , 2018, 2018 19th IEEE International Conference on Mobile Data Management (MDM).

[23]  Aoying Zhou,et al.  XML Structural Similarity Search Using MapReduce , 2010, WAIM.

[24]  Feifei Li,et al.  Ranking Large Temporal Data , 2012, Proc. VLDB Endow..

[25]  Jimmy J. Lin,et al.  Earlybird: Real-Time Search at Twitter , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[26]  Beng Chin Ooi,et al.  TI: an efficient indexing mechanism for real-time search on tweets , 2011, SIGMOD '11.

[27]  Gerhard Weikum,et al.  A Time Machine for Text Search , 2022 .

[28]  Jon Louis Bentley,et al.  Quad trees a data structure for retrieval on composite keys , 1974, Acta Informatica.

[29]  Meredith Ringel Morris,et al.  #TwitterSearch: a comparison of microblog search and web search , 2011, WSDM '11.