Top-k frequent items and item frequency tracking over sliding windows of any size

Abstract Many big data applications today require querying highly dynamic and large-scale data streams to find the top-k most frequent items in the most recent window of a specified size at a specific time. This is a challenging problem. We propose a novel approach called Floating Top-k. Our algorithm does not need to explicitly maintain any item counts over time or deal with count updates upon item entry and expiration. Succinctly , we use only a small-size data structure to retrieve the top-k items dynamically in a window of any specified size within an upper bound. We prove that the space and time costs of Floating Top-k grow only logarithmically with the window size rather than the linear growth of previous methods. Our comprehensive experiments using three real-world datasets show that Floating Top-k not only provides accuracy guarantees but also has a memory footprint two to three orders of magnitude smaller and is one to two orders of magnitude faster than previous approaches. Hence, Floating Top-k is both effective and scalable, significantly outperforming competing methods. In addition, we devise a concise and efficient solution called Progressive Trend Model to address a related problem of tracking the frequency of selected items, improving upon previous methods by 20 to 30 times in model conciseness while maintaining the same accuracy and efficiency.

[1]  Matjaz Perc,et al.  Inheritance patterns in citation networks reveal scientific memes , 2014, ArXiv.

[2]  Toon Calders,et al.  Mining top-k frequent items in a data stream with flexible sliding windows , 2010, KDD.

[3]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[4]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, TODS.

[5]  Zahir Tari,et al.  A Technique for Efficient Query Estimation over Distributed Data Streams , 2017, IEEE Transactions on Parallel and Distributed Systems.

[6]  Matjaz Perc,et al.  Information cascades in complex networks , 2017, J. Complex Networks.

[7]  J. Douglas Faires,et al.  Numerical Analysis , 1981 .

[8]  Erik D. Demaine,et al.  Identifying frequent items in sliding windows over on-line packet streams , 2003, IMC '03.

[9]  Marios Hadjieleftheriou,et al.  Methods for finding frequent items in data streams , 2010, The VLDB Journal.

[10]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[11]  Rajeev Motwani,et al.  Sampling from a moving window over streaming data , 2002, SODA '02.

[12]  Jayadev Misra,et al.  Finding Repeated Elements , 1982, Sci. Comput. Program..

[13]  Kyriakos Mouratidis,et al.  Continuous monitoring of top-k queries over sliding windows , 2006, SIGMOD Conference.

[14]  Odysseas Papapetrou,et al.  Sketch-based Querying of Distributed Sliding-Window Data Streams , 2012, Proc. VLDB Endow..

[15]  Divyakant Agrawal,et al.  Efficient Computation of Frequent and Top-k Elements in Data Streams , 2005, ICDT.

[16]  Dirk Helbing,et al.  Saving Human Lives: What Complexity Science and Information Systems can Contribute , 2014, Journal of statistical physics.

[17]  Steven C. Wheelwright,et al.  Forecasting methods and applications. , 1979 .

[18]  Srikanta Tirthapura,et al.  Distributed Streams Algorithms for Sliding Windows , 2002, SPAA '02.

[19]  Hongyan Liu,et al.  Methods for mining frequent items in data streams: an overview , 2009, Knowledge and Information Systems.

[20]  Matjaz Perc,et al.  Self-organization of progress across the century of physics , 2013, Scientific Reports.

[21]  Gurmeet Singh Manku,et al.  Approximate counts and quantiles over sliding windows , 2004, PODS.

[22]  Ugo Erra,et al.  Approximate TF-IDF based on topic extraction from massive message stream using the GPU , 2015, Inf. Sci..

[23]  Tingjian Ge,et al.  Top-k Frequent Items and Item Frequency Tracking over Sliding Windows of Any Sizes , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[24]  Xiaoyong Du,et al.  Persistent Data Sketching , 2015, SIGMOD Conference.

[25]  Katsiaryna Mirylenka,et al.  Conditional heavy hitters: detecting interesting correlations in data streams , 2015, The VLDB Journal.

[26]  Abdun Naser Mahmood,et al.  Computing discounted multidimensional hierarchical aggregates using modified Misra Gries algorithm , 2015, Perform. Evaluation.

[27]  Zahir Tari,et al.  Computing Hierarchical Summary from Two-Dimensional Big Data Streams , 2018, IEEE Transactions on Parallel and Distributed Systems.

[28]  Marco Pulimeno,et al.  Mining frequent items in the time fading model , 2016, Inf. Sci..

[29]  Sung Wook Baik,et al.  Efficient algorithms for mining top-rank-k erasable patterns using pruning strategies and the subsume concept , 2018, Eng. Appl. Artif. Intell..

[30]  Roy Friedman,et al.  Heavy hitters in streams and sliding windows , 2016, IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications.

[31]  Ling Chen,et al.  Mining frequent items in data stream using time fading model , 2014, Inf. Sci..

[32]  Marco Pulimeno,et al.  Parallel mining of time-faded heavy hitters , 2017, Expert Syst. Appl..

[33]  Joseph O'Rourke,et al.  An on-line algorithm for fitting straight lines between data ranges , 1981, CACM.

[34]  João Paulo Carvalho,et al.  Finding top-k elements in a time-sliding window , 2011, Evol. Syst..

[35]  Eli Upfal,et al.  Probability and Computing: Randomized Algorithms and Probabilistic Analysis , 2005 .

[36]  Lap-Kei Lee,et al.  A simpler and more efficient deterministic scheme for finding frequent items over sliding windows , 2006, PODS '06.