论文信息 - Top-k frequent items and item frequency tracking over sliding windows of any size

Top-k frequent items and item frequency tracking over sliding windows of any size

Abstract Many big data applications today require querying highly dynamic and large-scale data streams to find the top-k most frequent items in the most recent window of a specified size at a specific time. This is a challenging problem. We propose a novel approach called Floating Top-k. Our algorithm does not need to explicitly maintain any item counts over time or deal with count updates upon item entry and expiration. Succinctly , we use only a small-size data structure to retrieve the top-k items dynamically in a window of any specified size within an upper bound. We prove that the space and time costs of Floating Top-k grow only logarithmically with the window size rather than the linear growth of previous methods. Our comprehensive experiments using three real-world datasets show that Floating Top-k not only provides accuracy guarantees but also has a memory footprint two to three orders of magnitude smaller and is one to two orders of magnitude faster than previous approaches. Hence, Floating Top-k is both effective and scalable, significantly outperforming competing methods. In addition, we devise a concise and efficient solution called Progressive Trend Model to address a related problem of tracking the frequency of selected items, improving upon previous methods by 20 to 30 times in model conciseness while maintaining the same accuracy and efficiency.

[1] Matjaz Perc,et al. Inheritance patterns in citation networks reveal scientific memes , 2014, ArXiv.

[2] Toon Calders,et al. Mining top-k frequent items in a data stream with flexible sliding windows , 2010, KDD.

[3] Graham Cormode,et al. An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[4] Graham Cormode,et al. What's hot and what's not: tracking most frequent items dynamically , 2003, TODS.

[5] Zahir Tari,et al. A Technique for Efficient Query Estimation over Distributed Data Streams , 2017, IEEE Transactions on Parallel and Distributed Systems.

[6] Matjaz Perc,et al. Information cascades in complex networks , 2017, J. Complex Networks.

[7] J. Douglas Faires,et al. Numerical Analysis , 1981 .

[8] Erik D. Demaine,et al. Identifying frequent items in sliding windows over on-line packet streams , 2003, IMC '03.

[9] Marios Hadjieleftheriou,et al. Methods for finding frequent items in data streams , 2010, The VLDB Journal.

[10] Piotr Indyk,et al. Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[11] Rajeev Motwani,et al. Sampling from a moving window over streaming data , 2002, SODA '02.

[12] Jayadev Misra,et al. Finding Repeated Elements , 1982, Sci. Comput. Program..

[13] Kyriakos Mouratidis,et al. Continuous monitoring of top-k queries over sliding windows , 2006, SIGMOD Conference.

[14] Odysseas Papapetrou,et al. Sketch-based Querying of Distributed Sliding-Window Data Streams , 2012, Proc. VLDB Endow..

[15] Divyakant Agrawal,et al. Efficient Computation of Frequent and Top-k Elements in Data Streams , 2005, ICDT.

[16] Dirk Helbing,et al. Saving Human Lives: What Complexity Science and Information Systems can Contribute , 2014, Journal of statistical physics.

[17] Steven C. Wheelwright,et al. Forecasting methods and applications. , 1979 .

[18] Srikanta Tirthapura,et al. Distributed Streams Algorithms for Sliding Windows , 2002, SPAA '02.

[19] Hongyan Liu,et al. Methods for mining frequent items in data streams: an overview , 2009, Knowledge and Information Systems.

[20] Matjaz Perc,et al. Self-organization of progress across the century of physics , 2013, Scientific Reports.

[21] Gurmeet Singh Manku,et al. Approximate counts and quantiles over sliding windows , 2004, PODS.

[22] Ugo Erra,et al. Approximate TF-IDF based on topic extraction from massive message stream using the GPU , 2015, Inf. Sci..

[23] Tingjian Ge,et al. Top-k Frequent Items and Item Frequency Tracking over Sliding Windows of Any Sizes , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[24] Xiaoyong Du,et al. Persistent Data Sketching , 2015, SIGMOD Conference.

[25] Katsiaryna Mirylenka,et al. Conditional heavy hitters: detecting interesting correlations in data streams , 2015, The VLDB Journal.

[26] Abdun Naser Mahmood,et al. Computing discounted multidimensional hierarchical aggregates using modified Misra Gries algorithm , 2015, Perform. Evaluation.

[27] Zahir Tari,et al. Computing Hierarchical Summary from Two-Dimensional Big Data Streams , 2018, IEEE Transactions on Parallel and Distributed Systems.

[28] Marco Pulimeno,et al. Mining frequent items in the time fading model , 2016, Inf. Sci..

[29] Sung Wook Baik,et al. Efficient algorithms for mining top-rank-k erasable patterns using pruning strategies and the subsume concept , 2018, Eng. Appl. Artif. Intell..

[30] Roy Friedman,et al. Heavy hitters in streams and sliding windows , 2016, IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications.

[31] Ling Chen,et al. Mining frequent items in data stream using time fading model , 2014, Inf. Sci..

[32] Marco Pulimeno,et al. Parallel mining of time-faded heavy hitters , 2017, Expert Syst. Appl..

[33] Joseph O'Rourke,et al. An on-line algorithm for fitting straight lines between data ranges , 1981, CACM.

[34] João Paulo Carvalho,et al. Finding top-k elements in a time-sliding window , 2011, Evol. Syst..

[35] Eli Upfal,et al. Probability and Computing: Randomized Algorithms and Probabilistic Analysis , 2005 .

[36] Lap-Kei Lee,et al. A simpler and more efficient deterministic scheme for finding frequent items over sliding windows , 2006, PODS '06.