What's hot and what's not: tracking most frequent items dynamically

Most database management systems maintain statistics on the underlying relation. One of the important statistics is that of the "hot items" in the relation: those that appear many times (most frequently, or more than some threshold). For example, end-biased histograms keep the hot items as part of the histogram and are used in selectivity estimation. Hot items are used as simple outliers in data mining, and in anomaly detection in networking applications.We present a new algorithm for dynamically determining the hot items at any time in the relation that is undergoing deletion operations as well as inserts. Our algorithm maintains a small space data structure that monitors the transactions on the relation, and when required, quickly outputs all hot items, without rescanning the relation in the database. With user-specified probability, it is able to report all hot items. Our algorithm relies on the idea of "group testing", is simple to implement, and has provable quality, space and time guarantees. Previously known algorithms for this problem that make similar quality and performance guarantees can not handle deletions, and those that handle deletions can not make similar guarantees without rescanning the database. Our experiments with real and synthetic data shows that our algorithm is remarkably accurate in dynamically tracking the hot items independent of the rate of insertions and deletions.

[1]  Richard C. Singleton,et al.  Nonrandom binary superimposed codes , 1964, IEEE Trans. Inf. Theory.

[2]  Larry Carter,et al.  Universal Classes of Hash Functions , 1979, J. Comput. Syst. Sci..

[3]  Jayadev Misra,et al.  Finding Repeated Elements , 1982, Sci. Comput. Program..

[4]  Michael J. Fischer,et al.  Finding a Majority Among N Votes. , 1982 .

[5]  Stavros Christodoulakis,et al.  Optimal histograms for limiting worst-case error propagation in the size of join results , 1993, TODS.

[6]  D. Du,et al.  Combinatorial Group Testing and Its Applications , 1993 .

[7]  B. Mihov,et al.  Received; accepted , 1994 .

[8]  Yannis E. Ioannidis,et al.  Balancing histogram optimality and practicality for query result size estimation , 1995, SIGMOD '95.

[9]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[10]  John Beidler,et al.  Data Structures and Algorithms , 1996, Wiley Encyclopedia of Computer Science and Engineering.

[11]  Rajeev Motwani,et al.  Computing Iceberg Queries Efficiently , 1998, VLDB.

[12]  H. Garcia-Molina,et al.  Computing Iceberg Queries E ciently , 1998 .

[13]  Yossi Matias,et al.  New sampling-based summary statistics for improving approximate query answers , 1998, SIGMOD '98.

[14]  Noga Alon,et al.  Tracking join and self-join sizes in limited storage , 1999, PODS '99.

[15]  Christos Faloutsos,et al.  Online data mining for co-evolving time sequences , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[16]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[17]  Mikkel Thorup Even strongly universal hashing is pretty fast , 2000, SODA '00.

[18]  Sushil Jajodia,et al.  Detecting Novel Network Intrusions Using Bayes Estimators , 2001, SDM.

[19]  Anna C. Gilbert,et al.  QuickSAND: Quick Summary and Analysis of Network Data , 2001 .

[20]  George Varghese,et al.  New directions in traffic measurement and accounting , 2002, CCRV.

[21]  Yossi Matias,et al.  Fast incremental maintenance of approximate histograms , 1997, TODS.

[22]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[23]  Johannes Gehrke,et al.  Querying and mining data streams: you only get one look a tutorial , 2002, SIGMOD '02.

[24]  S. Muthukrishnan,et al.  How to Summarize the Universe: Dynamic Maintenance of Quantiles , 2002, VLDB.

[25]  Sudipto Guha,et al.  Fast, small-space algorithms for approximate histogram maintenance , 2002, STOC '02.

[26]  Erik D. Demaine,et al.  Frequency Estimation of Internet Packet Streams with Limited Space , 2002, ESA.

[27]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[28]  Christopher Olston,et al.  Distributed top-k monitoring , 2003, SIGMOD '03.

[29]  Richard M. Karp,et al.  A simple algorithm for finding frequent elements in streams and bags , 2003, TODS.

[30]  Graham Cormode,et al.  An Improved Data Stream Summary: The Count-Min Sketch and Its Applications , 2004, LATIN.

[31]  Graham Cormode,et al.  What's new: finding significant differences in network data streams , 2004, IEEE/ACM Transactions on Networking.

[32]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, TODS.

[33]  Gerhard Weikum,et al.  ACM Transactions on Database Systems , 2005 .

[34]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.